Irmin: September 2020 update

by the Irmin team on Sep 8th, 2020

This post will survey the latest design decisions and performance improvements made to irmin-pack, the Irmin storage backend used by Tezos. Tezos is an open-source blockchain technology, written in OCaml, which uses many libraries from the MirageOS ecosystem. For more context on the design of irmin-pack and how it is optimised for the Tezos use-case, you can check out our previous blog post.

This post showcases the improvements to irmin-pack since its initial deployment on Tezos:

  1. Faster read-only store instances
  2. Improved automatic flushing
  3. Staging generic serialisation operations
  4. More control over Index.merge
  5. Clearing stores

Faster read-only store instances

The Tezos use-case of Irmin requires both read-only and read-write store handles, with multiple readers and a single writer all accessing the same Irmin store concurrently. These store handles are held by different processes (with disjoint memory spaces) so the instances must use files on disk to synchronise, ensuring that the readers never miss updates from the writer. The writer instance automatically flushes its internal buffers to disk at regular intervals, allowing the readers to regularly pick up replace calls.

Until recently, each time a reader looked for a value – be it a commit, a node, or a blob – it first checked if the writer had flushed new contents to disk. This ensured that the readers always see the latest changes from the writer. However, if the writer isn't actively modifying the regions being read, the readers make one unnecessary system call per find. The higher the rate of reads, the more time is lost to these synchronisation points. This is particularly problematic in two use-cases:

  • Taking snapshots of the store. Tezos supports exporting portable snapshots of the store data. Since this operation only reads historic data in the store (traversing backwards from a given block hash), it's never necessary to synchronise with the writer.
  • Bulk writes. It's sometimes necessary for the writer to dump lots of new data to disk at once (for instance, when adding a commit to the history). In these cases, any readers will repeatedly synchronise with the disk even though they don't need to do so until the bulk operation is complete. More on this in the coming months!

To better support these use-cases, we dropped the requirement for readers to maintain strict consistency with the writer instance. Instead, readers can call an explicit sync function only when they need to see the latest concurrent updates from the writer instance.

In our benchmarks, there is a clear speed-up for find operations from readers:

[RO] Find in random order with implicit syncs
        Total time: 67.276527
        Operations per second: 148640.253086
        Mbytes per second: 6.378948
        Read amplification in syscalls: 3.919739
        Read amplification in bytes: 63.943734

[RO] Find in random order with only one call to sync
        Total time: 40.817458
        Operations per second: 244993.208543
        Mbytes per second: 10.513968
        Read amplification in syscalls: 0.919588
        Read amplification in bytes: 63.258072

Not only it is faster, we can see also that fewer system calls are used in the Read amplification in syscalls column. The benchmarks consists of reading 10,000,000 entries of 45 bytes each.

Relevant PRs: irmin #1008, index #175, index #198 and index #203.

Better flushing for the read-write instance

Irmin-pack uses an index to speed up find calls: a pack file is used to store pairs of (key, value) and an index records the address in pack where a key is stored. A read-write instance has to write both the index and the pack file, for a read-only instance to find a value. Moreover, the order in which the data is flushed to disk for the two files is important: the address for the pair (key, value) cannot be written before the pair itself. Otherwise the read-only instance can read an address for a non existing (key, value) pair. But both pack and index have internal buffers that accumulate data, in order to reduce the number of system calls, and both decide arbitrarily when to flush those buffers to disk.

We introduce a flush_callback argument in index, which registers a callback for whenever the index decides to flush. irmin-pack uses this callback to flush its pack file, resolving the issue of the dangling address.

Relevant PRs: index #189, index #216, irmin #1051.

Faster serialisation for Irmin.Type

Irmin uses a library of generic operations: functions that take a runtime representation of a type and derive some operation on that type. These are used in many places to automatically derive encoders and decoders for our types, which are then used to move data to and from disk. For instance:

val decode : 'a t -> string -> 'a
(** [decode t] is the binary decoder of values represented by [t]. *)

(** Read an integer from a binary-encoded file. *)
let int_of_file ~path = open_in_bin path |> input_line |> decode Irmin.Type.int32

The generic decode takes a representation of the type int32 and uses this to select the right binary decoder. Unfortunately, we pay the cost of this runtime specialisation every time we call int_of_file. If we're invoking the decoder for a particular type very often – such as when serialising store values – it's more efficient to specialise decode once:

(** Specialised binary decoder for integers. *)
let decode_int32 = decode Irmin.Type.int32

let int_of_file_fast ~path = open_in_bin path |> input_line |> decode_int32 contents

The question then becomes: how can we change decode to encourage it to be used in this more-efficient way? We can add a type wrapper – called staged – to prevent the user from passing two arguments to decode at once:

module Staged : sig
  type +'a t
  val   stage : 'a   -> 'a t
  val unstage : 'a t -> 'a
end

val decode : 'a t -> (string -> 'a) Staged.t
(** [decode t] needs to be explicitly unstaged before being used. *)

By forcing the user to add a Staged.unstage type coercion when using this function, we're encouraging them to hoist such operations out of their hot-loops:

(** The slow implementation no longer type-checks: *)

let int_of_file ~path = open_in_bin path |> input_line |> decode Irmin.Type.int32
(* Error: This expression has type (string -> 'a) Staged.t
 *        but an expression was expected of type string -> 'a *)

(* Instead, we know to pull [Staged.t] values out of hot-loops: *)

let decode_int32 = Staged.unstage (decode Irmin.Type.int32)

let int_of_file_fast ~path = open_in_bin path |> input_line |> decode_int32 contents

We made similar changes to the performance-critical generic functions in Irmin.Type, and observed significant performance improvements. We also added benchmarks for serialising various types.

Relevant PRs: irmin #1030 and irmin #1028.

There are other interesting factors at play, such as altering decode to increase the efficiency of the specialised decoders; we leave this for a future blog post.

More control over Index.merge

index regularly does a maintenance operation, called merge, to ensure fast look-ups while having a small memory imprint. This operation is concurrent with the most of other functions, it is however not concurrent with itself: a second merge needs to wait for a previous one to finish. When writing big chunks of data very often, merge operations become blocking. To help measuring and detecting a blocking merge, we added in the index API calls to check whether a merge is ongoing, and to time it.

We mentioned that merge is concurrent with most of the other function in index. One notable exception was close, which had to wait for any ongoing merge to finish, before closing the index. Now close interrupts an ongoing merge, but still leaves the index in a clean state.

Relevant PRs: index #185, irmin #1049 and index #215.

Clearing stores

Another feature we recently added is the possibility to clear the store. It is implemented by removing the old files on disk and opening fresh ones. However in irmin-pack, the read-only instance has to detect that a clear occurred. To do this, we add a generation in the header of the files used by an irmin-pack store, which is increased by the clear operation. A generation change signals to the read-only instance that it needs to close the file and open it again, to be able to read the latest values.

As the header of the files on disk changed with the addition of the clear operation, the irmin-pack stores created previous to this change are no longer supported. We added a migration function for stores created with the previous version (version 1) to the new version (version 2) of the store. You can call this migration function as follows:

 let open_store () =
    Store.Repo.v config
  in
  Lwt.catch open_store (function
      | Irmin_pack.Unsupported_version `V1 ->
          Logs.app (fun l -> l "migrating store to version 2") ;
          Store.migrate config ;
          Logs.app (fun l -> l "migration ended, opening store") ;
          open_store ()
      | exn ->
          Lwt.fail exn)

Relevant PRs: index #211, irmin #1047, irmin #1070 and irmin #1071.

Conclusion

We hope you've enjoyed this discussion of our recent work. Stay tuned for our next Tezos / MirageOS development update! Thanks to our commercial customers, users and open-source contributors for making this work possible.