Introducing irmin-pack
Senior Software Engineer
irmin-pack
is an Irmin storage backend
that we developed over the last year specifically to meet the
Tezos use-case. Tezos nodes were initially using an
LMDB-based backend for their storage, which after only a year of activity led to
250 GB
disk space usage, with a monthly growth of 25 GB
. Our goal was to
dramatically reduce this disk space usage.
Part of the Irmin.2.0.0 release and still under active development, it has been successfully integrated as the storage layer of Tezos nodes and has been running in production for the last ten months with great results. It reduces disk usage by a factor of 10, while still ensuring similar performance and consistency guarantees in a memory-constrained and concurrent environment.
irmin-pack
was presented along with Irmin v2 at the OCaml workshop 2020; you
can watch the presentation here:
General structure
irmin-pack
exposes functors that allow the user to provide arbitrary low-level
modules for handling I/O, and provides a fast key-value store interface composed
of three components:
- The
pack
is used to store the data contained in the Irmin store, as blobs. - The
dict
stores the paths where these blobs should live. - The
index
keeps track of the blobs that are present in the repository by containing location information in thepack
.
Each of these use both on-disk storage for persistence and concurrence and various in-memory caches for speed.
Storing the data in the pack
file
The pack
contains most of the data stored in this Irmin backend. It is an
append-only file containing the serialized data stored in the Irmin repository.
All three Irmin stores (see our architecture
page in the tutorial to learn more)
are contained in this single file.
Content
and Commit
serialization is straightforward through
Irmin.Type
. They are written along with their length (to allow
correct reading) and hash (to enable integrity checks). The hash is used to
resolve internal links inside the pack when nodes are written.
Optimizing large nodes
Serializing nodes is not as simple as contents. In fact, nodes might contain an arbitrarily large number of children, and serializing them as a long list of references might harm performance, as that means loading and writing a large amount of data for each modification, no matter how small this modification might be. Similarly, browsing the tree means reading large blocks of data, even though only one child is needed.
For this reason, we implemented a Patricia Tree representation of internal nodes that allows us to split the child list into smaller parts that can be accessed and modified independently, while still being quickly available when needed. This reduces duplication of tree data in the Irmin store and improves disk access times.
Of course, we provide a custom hashing mechanism, so that hashing the nodes using this partitioning is still backwards-compatible for users who rely on hash information regardless of whether the node is split or not.
Optimizing internal references
In the Git model, all data are content-addressable (i.e. data are always
referenced by their hash). This naturally lends to indexing data by hashes on
the disk itself (i.e. the links from commits
to nodes
and from nodes
to
nodes
or contents
are realized by hash).
We did not comply to this approach in irmin-pack
, for at least two reasons:
-
Referencing by hash does not allow fast recovery of the children, since there is no way to find the relevant blob directly in the
pack
by providing the hash. We will go into the details of this later in this post. -
While hashes are being used as simple objects, their size is not negligible. The default hashing function in Irmin is BLAKE2B, which provides 32-byte digests.
Instead, our internal links in the pack
file are concretized by the offsets –
int64
integers – of the children instead of their hash. Provided that the
trees are always written bottom-up (so that children already exist in the pack
when their parents are written), this solves both issues above. The data handled
by the backend is always immutable, and the file is append-only, ensuring that
the links can never be broken.
Of course, that encoding does not break the content-addressable property: one can always retrieve an arbitrary piece of data through its hash, but it allows internal links to avoid that indirection.
Deduplicating the path names through the dict
In fact, the most common operations when using irmin-pack
consist of modifying
the tree's leaves rather then its shape. This is similar to the way most of us
use Git: modifying the contents of files is very frequent, while renaming or
adding new files is rather rare. Even still, when writing a node
in a new
commit, that node must contain the path names of its children, which end up
being duplicated a large number of times.
The dict
is used for deduplication of path names so that the pack
file can
uniquely reference them using shorter identifiers. It is composed of an
in-memory bidirectional hash table, allowing to query from path to identifier
when serializing and referencing, and from identifier to path when deserializing
and dereferencing.
To ensure persistence of the data across multiple runs and in case of crashes,
the small size of the dict
– less than 15 Mb
in the Tezos use-case – allows
us to write the bindings to a write-only, append-only file that is fully read
and loaded on start-up.
We guarantee that the dict
memory usage is bounded by providing a capacity
parameter. Adding a binding is guarded by this capacity, and will be inlined in
the pack
file in case this limit has been reached. This scenario does not
happen during normal use of irmin-pack
, but prevents attacks that would make
the memory grow in an unbounded way.
Retrieve the data in the pack
by indexing
Since the pack
file is append-only, naively reading its data would require a
linear search through the whole file for each lookup. Instead, we provide an
index that maps hashes of data blocks to their location in the pack
file,
along with their length. This module allows quick recovery of the values queried
by hash.
It provides a simple key-value interface, that actually hides the most complex
part of irmin-pack
.
type t
val v : readonly:bool -> path:string -> t
val find : t -> Key.t -> Value.t
val replace : t -> Key.t -> Value.t -> unit
(* ... *)
It has lead most of our efforts in the development of irmin-pack
and is now
available as a separate library, wisely named index,
that you can checkout on
GitHub under mirage/index and via opam
as
the index
and index-unix
packages.
When index
is used inside irmin-pack
, the keys are the hashes of the data
stored in the backend, and the values are the (offset, length)
pair that
indicates the location in the pack
file. From now on in this post, we will
stick to the index
abstraction: key
and value
will refer to the keys and
values as viewed by the index
.
Our index is split into two major parts. The log
is relatively small, and most
importantly, bounded; it contains the recently-added bindings. The data
is
much larger, and contains older bindings.
The log
part consists of a hash table associating keys to values. In order to
ensure concurrent access, and to be able to recover on a crash, we also maintain
a write-only, append-only file with the same contents, such that both always
contain exactly the same data at any time.
When a new key-value binding is added index, the value is simply serialized
along with its key and added to the log
.
An obvious caveat of this approach is that the in-memory representation of the
log
(the hashtable) is unbounded. It also grows a lot, as the Tezos node
stores more that 400 million objects. Our memory constraint obviously does not
allow such unbounded structures. This is where the data
part comes in.
When the log
size reaches a – customizable – threshold, its bindings are
flushed into a data
component, that may already contain flushed data from
former log
overloads. We call this operation a merge.
The important invariant maintained by the merge
operation is that the data
file must remain sorted by the hash of the bindings. This will enable a fast
recovery of the data.
During this operation, both the log
and the former data
are read in sorted
order – data
is already sorted, and log
is small thus easy to sort in
memory – and merged into a merging_data
file. This file is atomically renamed
at the end of the operation to replace the older data
while still ensuring
correct concurrent accesses.
This operation obviously needs to re-write the whole index, so its execution is very expensive. For this reason, it is performed by a separate thread in the background to still allow regular use of the index and be transparent to the user.
In the meantime, a log_async
– similar to log
, with a file and a hash table
– is used to hold new bindings and ensure the data being merged and the new data
are correctly separated. At the end of the merge, the log_async
becomes the
new log
and is cleared to be ready for the next merge.
Recovering the data
This design allows us a fast lookup of the data present in the index. Whenever
find
or mem
is called, we first look into the log
, which is simply a call
to the corresponding Hashtbl
function, since this data is contained in memory.
If the data is not found in the log
, the data
file will be browsed. This
means access to recent values is generally faster, because it does not require
any access to the disk.
Searching in the data
file is made efficient by the invariant that we kept
during the merge
: the file is sorted by hash. The search algorithm consists in
an interpolation search, which is permitted by the even distribution of the
hashes that we store. The theoretical complexity of the interpolation search is
O(log (log n))
, which is generally better than a binary search, provided that
the computation of the interpolant is cheaper than reads, which is the case
here.
This approach allows us to find the data using approximately 5-6 reading steps
in the file, which is good, but still a source of slowdowns. For this reason, we
use a fan-out module on top of the interpolation search, able to tell us the
exact page in which a given key is located, in constant time, for an additional
space cost of ~100 Mb
. We use this to find the correct page of the disk, then
run the interpolation search in that page only. That approach allows us to find
the correct value with a single read in the data
file.
Conclusion
This new backend is now used byt the Tezos nodes in production, and manages to
reduce the storage size from 250 Gb
down to 25 Gb
, with a monthly growth
rate of 2 Gb
, achieving a tenfold reduction.
In the meantime, it provides and single writer, multiple readers access pattern that enables bakers and clients to connect to the same storage while it is operated.
On the memory side, all our components are memory bounded, and the bound is
generally customizable, the largest source of memory usage being the log
part
of the index
. While it can be reduced to fit in 1 Gb
of memory and run on
small VPS or Raspberry Pi, one can easily set a higher memory limit on a more
powerful machine, and achieve even better time performance.
Open-Source Development
Tarides champions open-source development. We create and maintain key features of the OCaml language in collaboration with the OCaml community. To learn more about how you can support our open-source work, discover our page on GitHub.
Stay Updated on OCaml and MirageOS!
Subscribe to our mailing list to receive the latest news from Tarides.
By signing up, you agree to receive emails from Tarides. You can unsubscribe at any time.