ocaml-git 2.0
Software Engineer
I'm very happy to announce a new major release of ocaml-git
(2.0).
This release is a 2-year effort to get a revamped
streaming API offering a full control over memory
allocation. This new version also adds production-ready implementations of
the wire protocol: git push
and git pull
now work very reliably
using the raw Git and smart HTTP protocol (SSH support will come
soon). git gc
is also implemented, and all of the basic bricks are
now available to create Git servers. MirageOS support is available
out-of-the-box.
Two years ago, we decided to rewrite ocaml-git
and split it into
standalone libraries. More details about these new libraries are also
given below.
But first, let's focus on ocaml-git
's new design. The primary goal was
to fix memory consumption issues that our users noticed with the previous version,
and to make git push
work reliably. We also took care about
not breaking the API too much, to ease the transition for current users.
Controlled allocations
There is a big difference in the way ocaml-git
and git
are designed: git
is a short-lived command-line tool which does not
care that much about allocation policies, whereas we wanted to build a
library that can be linked with long-lived Git client and/or server
applications. We had to make some (performance) compromises to support
that use-case, at the benefit of tighter allocation policies — and hence
more predictable memory consumption patterns.
Other Git libraries such as libgit2
also have to deal with similar concerns.
In order to keep a tight control on the allocated memory, we decided to
use decompress instead of
camlzip
. decompress
allows the users to provide their own buffer
instead of allocating dynamically. This allowed us to keep a better
control on memory consumption. See below for more details on decompress
.
We also used angstrom and encore to provide a streaming interface to encode and decode Git objects. The streaming API is currently hidden to the end-user, but it helped us a lot to build abstraction and, again, on managing the allocation policy of the library.
Complete PACK file support (including GC)
In order to find the right abstraction for manipulating pack files in a long-lived application, we experimented with various prototypes. We haven't found the right abstractions just yet, but we believe the PACK format could be useful to store any kind of data in the future (and not especially Git objects).
We implemented git gc
by following the same heuristics as
Git
to compress pack files and
we produce something similar in size — decompress
has a good ratio about
compression — and we are using duff
, our own implementation of xdiff
, the
binary diff algorithm used by Git (more details on duff
below).
We also had to re-implement the streaming algorithm to reconstruct idx
files on
the fly, when receiving pack file on the network.
One notable feature of our compression algorithms is they work without the assumption that the underlying system implements POSIX: hence, they can work fully in-memory, in a browser using web storage or inside a MirageOS unikernel with wodan.
Production-ready push and pull
We re-implemented and abstracted the Git Smart protocol, and used that
abstraction to make git push
and git pull
work over HTTP. By
default we provide a cohttp
implementation but users can use their own — for instance based on
httpaf.
As proof-of-concept, the initial
pull-request of ocaml-git
2.0 was
created using this new implementation; moreover, we wrote a
prototype of a Git client compiled with js_of_ocaml
, which was able
to run git pull
over HTTP inside a browser!
Finally, that implementation will allow MirageOS unikernels to synchronize their
internal state with external Git stores (hosted for instance on GitHub)
using push/pull mechanisms. We also expect to release a server-side implementation
of the smart HTTP protocol, so that the state of any unikernel can be inspected
via git pull
. Stay tuned for more updates on that topic!
Standalone Dependencies
Below you can find the details of the new stable releases of libraries that are
used by ocaml-git
2.0.
optint
and checkseum
In some parts of ocaml-git
, we need to compute a Circular
Redundancy Check value. It is 32-bit integer value. optint
provides
an abstraction of it but structurally uses an unboxed integer or a
boxed int32
value depending on target (32 bit or 64 bit architecture).
checkseum
relies on optint
and provides 3 implementations of CRC:
- Adler32 (used by
zlib
format) - CRC32 (used by
gzip
format andgit
) - CRC32-C (used by
wodan
)
checkseum
uses the linking trick: this means that users of the
library program against an abstract API (only the cmi
is provided);
at link-time, users have to select which implementation to use:
checkseum.c
(the C implementation) or checkseum.ocaml
(the OCaml
implementation). The process is currently a bit cumbersome but upcoming
dune
release will make that process much more transparent to the users.
encore
(/angkor/)
In git
, we work with Git objects (tree, blob or
commit). These objects are encoded in a specific format. Then,
the hash of these objects are computed from the encoded
result to get a unique identifier. For example, the hash of your last commit is:
sha1(encode(commit))
.
A common operation in git
is to decode Git objects from an encoded
representation of them (especially in .git/objects/*
as a loose
file) and restore them in another part of your Git repository (like in a
PACK file or on the command-line).
Hence, we need to ensure that encoding is always deterministic, and that decoding an encoded Git object is always the identity, e.g. there is an isomorphism between the decoder and the encoder.
let decoder <.> encoder : value -> value = id
let encoder <.> decoder : string -> string = id
encore is a library in which you can describe a format (like Git format) and from it, we can derive a streaming decoder and encoder that are isomorphic by construction.
duff
duff is a pure implementation in
OCaml of the xdiff
algorithm.
Git has an optimized representation of your Git repository. It's a
PACK file. This format uses a binary diff algorithm called xdiff
to compress binary data. xdiff
takes a source A and a target B and try
to find common sub-strings between A and B.
This is done by a Rabin's fingerprint of the source A applied to the target B. The fingerprint can then be used to produce a lightweight representation of B in terms of sub-strings of A.
duff
implements this algorithm (with additional Git's constraints,
regarding the size of the sliding windows) in OCaml. It provides a
small binary xduff
that complies with the format of Git without the zlib
layer.
$ xduff diff source target > target.xduff
$ xduff patch source < target.xduff > target.new
$ diff target target.new
$ echo $?
0
decompress
decompress
is a pure implementation in OCaml of zlib
and
rfc1951
. You can compress and decompress data flows and, obviously,
Git does this compression in loose files and PACK files.
It provides a non-blocking interface and is easily usable in a server
context. Indeed, the implementation never allocates and only relies on
what the user provides (window
, input and output buffer). Then, the
distribution provides an easy example of how to use decompress
:
val inflate: ?level:int -> string -> string
val deflate: string -> string
digestif
digestif is a toolbox providing many implementations of hash algorithms such as:
- MD5
- SHA1
- SHA224
- SHA256
- SHA384
- SHA512
- BLAKE2B
- BLAKE2S
- RIPEMD160
Like checkseum
, digestif
uses the linking trick too: from a
shared interface, it provides 2 implementations, in C (digestif.c
)
and OCaml (digestif.ocaml
).
Regarding Git, we use the SHA1 implementation and we are ready to
migrate ocaml-git
to BLAKE2{B,S} as the Git core team expects - and,
in the OCaml world, it is just a functor application with
another implementation.
eqaf
Some applications require that secret values are compared in constant
time. Functions like String.equal
do not have this property, so we
have decided to provide a small package — eqaf —
providing a constant-time equal
function.
digestif
uses it to check equality of hashes — it also exposes
unsafe_compare
if you don't care about timing attacks in your application.
Of course, the biggest work on this package is not about the
implementation of the equal
function but a way to check the
constant-time assumption on this function. Using this, we did a
benchmark on Linux,
Windows and Mac to check it.
An interesting fact is that after various experiments, we replaced the initial implementation in C (extracted from OpenBSD's timingsafe_memcmp) with an OCaml implementation behaving in a much more predictable way on all the tested platforms.
Conclusion
The upcoming version 2.0 of Irmin is using ocaml-git to create small applications that push and pull their state to GitHub. We think that Git offers a very nice model to persist data for distributed applications and we hope that more people will use ocaml-git to experiment and manipulate application data in Git. Please send us your feedback!
Open-Source Development
Tarides champions open-source development. We create and maintain key features of the OCaml language in collaboration with the OCaml community. To learn more about how you can support our open-source work, discover our page on GitHub.
Stay Updated on OCaml and MirageOS!
Subscribe to our mailing list to receive the latest news from Tarides.
By signing up, you agree to receive emails from Tarides. You can unsubscribe at any time.