ocaml-git 2.0

by Romain Calascibetta on Oct 19th, 2018

I'm very happy to announce a new major release of ocaml-git (2.0). This release is a 2-year effort to get a revamped streaming API offering a full control over memory allocation. This new version also adds production-ready implementations of the wire protocol: git push and git pull now work very reliably using the raw Git and smart HTTP protocol (SSH support will come soon). git gc is also implemented, and all of the basic bricks are now available to create Git servers. MirageOS support is available out-of-the-box.

Two years ago, we decided to rewrite ocaml-git and split it into standalone libraries. More details about these new libraries are also given below.

But first, let's focus on ocaml-git's new design. The primary goal was to fix memory consumption issues that our users noticed with the previous version, and to make git push work reliably. We also took care about not breaking the API too much, to ease the transition for current users.

Controlled allocations

There is a big difference in the way ocaml-git and git are designed: git is a short-lived command-line tool which does not care that much about allocation policies, whereas we wanted to build a library that can be linked with long-lived Git client and/or server applications. We had to make some (performance) compromises to support that use-case, at the benefit of tighter allocation policies — and hence more predictable memory consumption patterns. Other Git libraries such as libgit2 also have to deal with similar concerns.

In order to keep a tight control on the allocated memory, we decided to use decompress instead of camlzip. decompress allows the users to provide their own buffer instead of allocating dynamically. This allowed us to keep a better control on memory consumption. See below for more details on decompress.

We also used angstrom and encore to provide a streaming interface to encode and decode Git objects. The streaming API is currently hidden to the end-user, but it helped us a lot to build abstraction and, again, on managing the allocation policy of the library.

Complete PACK file support (including GC)

In order to find the right abstraction for manipulating pack files in a long-lived application, we experimented with various prototypes. We haven't found the right abstractions just yet, but we believe the PACK format could be useful to store any kind of data in the future (and not especially Git objects).

We implemented git gc by following the same heuristics as Git to compress pack files and we produce something similar in size — decompress has a good ratio about compression — and we are using duff, our own implementation of xdiff, the binary diff algorithm used by Git (more details on duff below). We also had to re-implement the streaming algorithm to reconstruct idx files on the fly, when receiving pack file on the network.

One notable feature of our compression algorithms is they work without the assumption that the underlying system implements POSIX: hence, they can work fully in-memory, in a browser using web storage or inside a MirageOS unikernel with wodan.

Production-ready push and pull

We re-implemented and abstracted the Git Smart protocol, and used that abstraction to make git push and git pull work over HTTP. By default we provide a cohttp implementation but users can use their own — for instance based on httpaf. As proof-of-concept, the initial pull-request of ocaml-git 2.0 was created using this new implementation; moreover, we wrote a prototype of a Git client compiled with js_of_ocaml, which was able to run git pull over HTTP inside a browser!

Finally, that implementation will allow MirageOS unikernels to synchronize their internal state with external Git stores (hosted for instance on GitHub) using push/pull mechanisms. We also expect to release a server-side implementation of the smart HTTP protocol, so that the state of any unikernel can be inspected via git pull. Stay tuned for more updates on that topic!

Standalone Dependencies

Below you can find the details of the new stable releases of libraries that are used by ocaml-git 2.0.

optint and checkseum

In some parts of ocaml-git, we need to compute a Circular Redundancy Check value. It is 32-bit integer value. optint provides an abstraction of it but structurally uses an unboxed integer or a boxed int32 value depending on target (32 bit or 64 bit architecture).

checkseum relies on optint and provides 3 implementations of CRC:

  • Adler32 (used by zlib format)
  • CRC32 (used by gzip format and git)
  • CRC32-C (used by wodan)

checkseum uses the linking trick: this means that users of the library program against an abstract API (only the cmi is provided); at link-time, users have to select which implementation to use: checkseum.c (the C implementation) or checkseum.ocaml (the OCaml implementation). The process is currently a bit cumbersome but upcoming dune release will make that process much more transparent to the users.

encore (/angkor/)

In git, we work with Git objects (tree, blob or commit). These objects are encoded in a specific format. Then, the hash of these objects are computed from the encoded result to get a unique identifier. For example, the hash of your last commit is: sha1(encode(commit)).

A common operation in git is to decode Git objects from an encoded representation of them (especially in .git/objects/* as a loose file) and restore them in another part of your Git repository (like in a PACK file or on the command-line).

Hence, we need to ensure that encoding is always deterministic, and that decoding an encoded Git object is always the identity, e.g. there is an isomorphism between the decoder and the encoder.

let decoder <.> encoder : value -> value = id
let encoder <.> decoder : string -> string = id

encore is a library in which you can describe a format (like Git format) and from it, we can derive a streaming decoder and encoder that are isomorphic by construction.

duff

duff is a pure implementation in OCaml of the xdiff algorithm. Git has an optimized representation of your Git repository. It's a PACK file. This format uses a binary diff algorithm called xdiff to compress binary data. xdiff takes a source A and a target B and try to find common sub-strings between A and B.

This is done by a Rabin's fingerprint of the source A applied to the target B. The fingerprint can then be used to produce a lightweight representation of B in terms of sub-strings of A.

duff implements this algorithm (with additional Git's constraints, regarding the size of the sliding windows) in OCaml. It provides a small binary xduff that complies with the format of Git without the zlib layer.

$ xduff diff source target > target.xduff
$ xduff patch source < target.xduff > target.new
$ diff target target.new
$ echo $?
0

decompress

decompress is a pure implementation in OCaml of zlib and rfc1951. You can compress and decompress data flows and, obviously, Git does this compression in loose files and PACK files.

It provides a non-blocking interface and is easily usable in a server context. Indeed, the implementation never allocates and only relies on what the user provides (window, input and output buffer). Then, the distribution provides an easy example of how to use decompress:

val inflate: ?level:int -> string -> string
val deflate: string -> string

digestif

digestif is a toolbox providing many implementations of hash algorithms such as:

  • MD5
  • SHA1
  • SHA224
  • SHA256
  • SHA384
  • SHA512
  • BLAKE2B
  • BLAKE2S
  • RIPEMD160

Like checkseum, digestif uses the linking trick too: from a shared interface, it provides 2 implementations, in C (digestif.c) and OCaml (digestif.ocaml).

Regarding Git, we use the SHA1 implementation and we are ready to migrate ocaml-git to BLAKE2{B,S} as the Git core team expects - and, in the OCaml world, it is just a functor application with another implementation.

eqaf

Some applications require that secret values are compared in constant time. Functions like String.equal do not have this property, so we have decided to provide a small package — eqaf — providing a constant-time equal function. digestif uses it to check equality of hashes — it also exposes unsafe_compare if you don't care about timing attacks in your application.

Of course, the biggest work on this package is not about the implementation of the equal function but a way to check the constant-time assumption on this function. Using this, we did a benchmark on Linux, Windows and Mac to check it.

An interesting fact is that after various experiments, we replaced the initial implementation in C (extracted from OpenBSD's timingsafe_memcmp) with an OCaml implementation behaving in a much more predictable way on all the tested platforms.

Conclusion

The upcoming version 2.0 of Irmin is using ocaml-git to create small applications that push and pull their state to GitHub. We think that Git offers a very nice model to persist data for distributed applications and we hope that more people will use ocaml-git to experiment and manipulate application data in Git. Please send us your feedback!