Decompress: The New Decompress API

by Romain Calascibetta on Aug 26th, 2019

RFC 1951 is one of the most used standards. Indeed, when you launch your Linux kernel, it inflates itself according zlib standard, a superset of RFC 1951. Being a widely-used standard, we decided to produce an OCaml implementation. In the process, we learned many lessons about developing OCaml where we would normally use C. So, we are glad to present decompress.

One of the many users of RFC 1951 is Git, which uses it to pack data objects into a PACK file. At the request of @samoht, decompress appeared some years ago as a Mirage-compatible replacement for zlib to be used for compiling a MirageOS unikernel with ocaml-git. Today, this little project passes a major release with substantial improvements in several domains.

decompress provides an API for inflating and deflating flows[1]. The main goal is to provide a platform-agnostic library: one which may be compiled on any platform, including JavaScript. We surely cannot be faster than C implementations like zstd or lz4, but we can play some optimisation tricks to help bridge the gap. Additionally, OCaml can protect the user against lot of bugs via the type-system and the runtime too (e.g. using array bounds checking). ocaml-tls was implemented partly in response to the famous failure of openssl; a vulnerability which could not exist in OCaml.

[1]: A flow, in MirageOS land, is an abstraction which wants to receive and/or transmit something under a standard. So it's usual to say a TLS-flow for example.

API design

The API should be the most difficult part of designing a library - it reveals what we can do and how we should do it. In this way, an API should:

  1. constrain the user to avoid security issues; too much freedom can be a bad thing. As an example, consider the Hashtbl.create function, which allows the user to pass ~random:false to select a fixed hash function. The resulting hashtable suffers deterministic key collisions, which can be exploited by an attacker.

    An example of good security-awareness in API design can be seen in digestif, which provided an unsafe_compare instead of the common compare function (before eqaf.0.5). In this way, it enforced the user to create an alias of it if they want to use a hash in a Map – however, by this action, they should know that they are not protected against a timing-attack.
  2. allow some degrees of freedom to fit within many environments; a constrained API cannot support a hostile context. For example, when compiling to an ESP32 target, even small details such as the length of a stream input buffer must be user-definable. When deploying to a server, memory consumption should be deterministic.

    Of course, this is made difficult when too much freedom will enable misuse of the API – an example is dune which wants consciously to limit the user about what they can do with it.
  3. imply an optimal design of how to use it. Possibilities should serve the user, but these can make the API harder to understand; this is why documentation is important. Your API should tell your users how it wants to be treated.

A dbuenzli API

From our experiences with protocol/format, one design stands out: the dbuenzli API. If you look into some famous libraries in the OCaml eco-system, you probably know uutf, jsonm or xmlm. All of these libraries provide the same API for computing a Unicode/JSON/XML flow – of course, the details are not the same.

From a MirageOS perspective, even if they use the in_channel/out_channel abstraction rather than a Mirage flow, these libraries are system-agnostic since they let the user to choose input and output buffers. Most importantly, they don't use the standard OCaml Unix module, which cannot be used in a unikernel.

The APIs are pretty consistent and try to do their best-effort[2] of decoding. The design has a type state which represents the current system status; the user passes this to decode/encode to carry out the processing. Of course, these functions have a side-effect on the state internally, but this is hidden from the user. One advantage of including states in a design is that the underlying implementation is very amenable to compiler optimisations (e.g. tail-call optimisation). Internally, of course, we have a porcelain[3] implementation where any details can have an rational explanation.

In the beginning, decompress wanted to follow the same interface without the mutability (a choice about performances) and it did. Then, the hard test was to use it in a bigger project; in this case, ocaml-git. An iterative process was used to determine what was really needed, what we should not provide (like special cases) and what we should provide to reach an uniform API that is not too difficult to understand.

From this experience, we finalised the initial decompress API and it did not change significantly for 4 versions (2 years).

[2]: best-effort means an user control on the error branch where we don't leak exception (or more generally, any interrupts)

[3]: porcelain means implicit invariants held in the mind of the programmer (or the assertions/comments).

The new decompress API

The new decompress keeps the same inflation logic, but drastically changes the deflator to make the flush operation clearer. For many purposes, people don't want to hand-craft their compressed flows – they just want of_string/to_string functions. However, in some contexts (like a PNG encoder/decoder), the user should be able to play with decompress in detail (OpenPGP needs this too in RFC 4880).

The Zlib format

Both decompress and zlib use Huffman coding, an algorithm for building a dictionary of variable-length codewords for a given set of symbols (in this case, bytes). The most common byte is assigned the shortest bit sequence; less common bytes are assigned longer codewords. Using this dictionary, we just translate each byte into its codeword and we should achieve a good compression ratio. Of course, there are other details, such as the fact that all Huffman codes are prefix-free. The compression can be taken further with the LZ77 algorithm.

The zlib format, a superset of the RFC 1951 format, is easy to understand. We will only consider the RFC 1951 format, since zlib adds only minor details (such as checksums). It consists of several blocks: DEFLATE blocks, each with a little header, and the contents. There are 3 kinds of DEFLATE blocks:

  • a FLAT block; no compression, just a blit from inputs to the current block.
  • a FIXED block; compressed using a pre-computed Huffman code.
  • a DYNAMIC block; compressed using a user-specified Huffman code.

The FIXED block uses a Huffman dictionary that is computed when the OCaml runtime is initialised. DYNAMIC blocks use dictionaries specified by the user, and so these must be transmitted alongside the data (after being compressed with another Huffman code!). The inflator decompresses this DYNAMIC dictionary and uses it to do the reverse translation from bit sequences to bytes.

Inflator

The design of the inflator did not change a lot from the last version of decompress. Indeed, it's about to take an input, compute it and return an output like a flow. Of course, the error case can be reached.

So the API is pretty-easy:

val decode : decoder -> [ `Await | `Flush | `End | `Malformed of string ]

As you can see, we have 4 cases: one which expects more inputs (Await), the second which asks to the user to flush internal buffer (Flush), the End case when we reach the end of the flow and the Malformed case when we encounter an error.

For each case, the user can do several operations. Of course, about the Await case, they can refill the contents with an other inputs buffer with:

val src : decoder -> bigstring -> off:int -> len:int -> unit

This function provides the decoder a new input with len bytes to read starting at off in the given bigstring.

In the Flush case, the user wants some information like how many bytes are available in the current output buffer. Then, we should provide an action to flush this output buffer. In the end, this output buffer should be given by the user (how many bytes they want to allocate to store outputs flow).

type src = [ `Channel of in_channel | `Manual | `String of string ]

val dst_rem : decoder -> int
val flush : decoder -> unit
val decoder : src -> o:bigstring -> w:bigstring -> decoder

The last function, decoder, is the most interesting. It lets the user, at the beginning, choose the context in which they want to inflate inputs. So they choose:

  • src, where come from inputs flow
  • o, output buffer
  • w, window buffer

o will be used to store inflated outputs, dst_rem will give to us how many bytes inflator stored in o and flush will just set decoder to be able to recompute the flow.

w is needed for lz77 compression. However, as we said, we let the user give us this intermediate buffer. The idea behind that is to let the user prepare an inflation. For example, in ocaml-git, instead of allocating w systematically when we want to decompress a Git object, we allocate w one time per threads and all are able to use it and re-use it. In this way, we avoid systematic allocations (and allocate only once time) which can have a serious impact about performances.

The design is pretty close to one idea, a description step by the decoder function and a real computation loop with the decode function. The idea is to prepare the inflation with some information (like w and o) before the main (and the most expensive) computation. Internally we do that too (but it's mostly about a macro-optimization).

It's the purpose of OCaml in general, be able to have a powerful way to describe something (with constraints). In our case, we are very limited to what we need to describe. But, in others libraries like angstrom, the description step is huge (describe the parser according to the BNF) and then, we use it to the main computation, in the case of angstrom, the parsing (another example is [cmdliner][cmdliner]).

This is why decoder can be considered as the main function where decode can be wrapped under a stream.

Deflator

The deflator is a new (complex) deal. Indeed, behind it we have two concepts:

  • the encoder (according to RFC 1951)
  • the compressor

For this new version of decompress, we decide to separate these concepts where one question leads all: how to put my compression algorithm? (instead to use LZ77).

In fact, if you are interested in compression, several algorithms exist and, in some context, it's preferable to use lzwa for example or rabin's fingerprint (with duff), etc.

Functor

The first idea was to provide a functor which expects an implementation of the compression algorithm. However, the indirection of a functor comes with (big) performance cost. Consider the following functor example:

module type S = sig
  type t
  val add : t -> t -> t
  val one : t
end

module Make (S : S) = struct let succ x = S.add x S.one end

include Make (struct
  type t = int
  let add a b = a + b
  let one = 1
end)

let f x = succ x

Currently, with OCaml 4.07.1, the f function will be a caml_apply2. We might wish for a simple inlining optimisation, allowing f to become an addq instruction (indeed, flambda does this), but optimizing functors is hard. As we learned from Pierre Chambart, it is possible for the OCaml compiler to optimize functors directly, but this requires respecting several constraints that are difficult to respect in practice.

Split encoder and compressor

So, the choice was done to made the encoder which respects RFC 1951 and the compressor under some constraints. However, this is not what zlib did and, by this way, we decided to provide a new design/API which did not follow, in first instance, zlib (or some others implementations like miniz).

To be fair, the choice from zlib and miniz comes from the first point about API and the context where they are used. The main problem is the shared queue between the encoder and the compressor. In C code, it can be hard for the user to deal with it (where they are liable for buffer overflows).

In OCaml and for decompress, the shared queue can be well-abstracted and API can ensure assumptions (like bounds checking).

Even if this design is much more complex than before, coverage tests are better where we can separately test the encoder and the compressor. It breaks down the initial black-box where compression was intrinsec with encoding – which was error-prone. Indeed, decompress had a bug about generation of Huffman codes but we never reached it because the (bad) compressor was not able to produce something (a specific lengh with a specific distance) to get it.

NOTE: you have just read the main reason for the new version of decompress!

The compressor

The compressor is the most easy part. The goal is to produce from an inputs flow, an outputs flow which is an other (more compacted) representation. This representation consists to:

  • A literal, the byte as is
  • A copy code with an offset and a length

The last one say to copy length byte(s) from offset. For example, aaaa can be compressed as [ Literal 'a'; Copy (offset:1, len:3) ]. By this way, instead to have 4 bytes, we have only 2 elements which will be compressed then by an Huffman coding. This is the main idea of the lz77 compression.

However, the compressor should need to deal with the encoder. An easy interface, à la uutf should be:

val compress : state -> [ `Literal of char | `Copy of (int * int) | `End | `Await ]

But as I said, we need to feed a queue instead.


At this point, the purpose of the queue is not clear and not really explained. The signature above still is a valid and understandable design. Then, we can imagine passing Literal and Copy directly to the encoder. However, we should (for performance purpose) use a delaying tactic between the compressor and the deflator4.

Behind this idea, it's to be able to implement an hot-loop on the encoder which will iter inside the shared queue and transmit/encode contents directly to the outputs buffer.


So, when we make a new state, we let the user supply their queue:

val state : src -> w:bistring -> q:queue -> state
val compress : state -> [ `Flush | `Await | `End ]

The Flush case appears when the queue is full. Then, we refind the w window buffer which is needed to produce the Copy code. A copy code is limited according RFC 1951 where offset can not be upper than the length of the window (commonly 32ko). length is limited too to 258 (an arbitrary choice).

Of course, about the Await case, the compressor comes with a src function as the inflator. Then, we added some accessors, literals and distances. The compressor does not build the Huffman coding which needs frequencies, so we need firstly to keep counters about that inside the state and a way to get them (and pass them to the encoder).

[4]: About that, you should be interesting by the reason of why GNU yes is so fast where the secret is just about buffering.

The encoder

Finally, we can talk about the encoder which will take the shared queue filled by the compressor and provide an RFC 1951 compliant output flow.

However, we need to consider a special detail. When we want to make a DYNAMIC block from frequencies and then encode the inputs flow, we can reach a case where the shared queue contains an opcode (a literal or a copy) which does not appear in our dictionary.

In fact, if we want to encode [ Literal 'a'; Literal 'b' ], we will not try to make a dictionary which will contains the 256 possibilities of a byte but we will only make a dictionary from frequencies which contains only 'a' and 'b'. By this way, we can reach a case where the queue contains an opcode (like Literal 'c') which can not be encoded by the pre-determined Huffman coding – remember, the DYNAMIC block starts with the dictionary.

Another point is about inputs. The encoder expects, of course, contents from the shared queue but it wants from the user the way to encode contents: which block we want to emit. So it has two entries:

  • the shared queue
  • an user-entry

So for many real tests, we decided to provide this kind of API:

type dst = [ `Channel of out_channel | `Buffer of Buffer.t | `Manual ]

val encoder : dst -> q:queue -> encoder
val encode : encoder -> [ `Block of block | `Flush | `Await ] -> [ `Ok | `Partial | `Block ]
val dst : encoder -> bigstring -> off:int -> len:int -> unit

As expected, we take the shared queue to make a new encoder. Then, we let the user to specify which kind of block they want to encode by the Block operation.

The Flush operation tries to encode all elements present inside the shared queue according to the current block and feed the outputs buffer. From it, the encoder can returns some values:

  • Ok and the encoder encoded all opcode from the shared queue
  • Partial, the outputs buffer is not enough to encode all opcode, the user should flush it and give to us a new empty buffer with dst. Then, they must continue with the Await operation.
  • Block, the encoder reachs an opcode which can not be encoded with the current block (the current dictionary). Then, the user must continue with a new Block operation.

The hard part is about the ping-pong game between the user and the encoder where a Block expects a Block response from the user and a Partial expects an Await response. But this design reveals something higher about zlib this time: the flush mode.

The flush mode

Firstly, we talk about mode because zlib does not allow the user to decide what they want to do when we reach a Block or a Ok case. So, it defines some under-specified modes to apply a policy of what to do in this case.

In decompress, we followed the same design and see that it may be not a good idea where the logic is not very clear and the user wants may be an another behavior. It was like a black-box with a black-magic.

Because we decided to split encoder and compressor, the idea of the flush mode does not exists anymore where the user explicitly needs to give to the encoder what they want (make a new block? which block? keep frequencies?). So we broke the black-box. But, as we said, it was possible mostly because we can abstract safely the shared queue between the compressor and the encoder.

OCaml is an expressive language and we can really talk about a queue where, in C, it will be just an other array. As we said, the deal is about performance, but now, we allow the user the possibility to write their code in this corner-case which is when they reachs Block. Behaviors depends only on them.

APIs in general

The biggest challenge of building a library is defining the API - you must strike a compromise between allowing the user the flexibility to express their use-case and constraining the user to avoid API misuse. If at all possible, provide an intuitive API: force the user not to need to think about security issues, memory consumption or performance.

Avoid making your API so expressive that it becomes unusable, but beware that this sets hard limits on your users: the current decompress API can be used to build of_string / to_string functions, but the opposite is not true - you definitely cannot build a stream API from of_string / to_string.

The best advice when designing a library is to keep in mind what you really want and let the other details fall into place gradually. It is very important to work in an iterative loop of repeatedly trying to use your library; only this can highlight bad design, corner-cases and details.

Finally, use and re-use it on your tests (important!) and inside higher-level projects to give you interesting questions about your design. The last version of decompress was not used in ocaml-git mostly because the flush mode was unclear.