Decompress: The New Decompress API
Software Engineer
RFC 1951 is one of the most used standards. Indeed,
when you launch your Linux kernel, it inflates itself according zlib
standard, a superset of RFC 1951. Being a widely-used standard, we decided to
produce an OCaml implementation. In the process, we learned many lessons about
developing OCaml where we would normally use C. So, we are glad to present
decompress
.
One of the many users of RFC 1951 is Git, which uses it to pack data
objects into a PACK file. At the request of @samoht,
decompress
appeared some years ago as a Mirage-compatible replacement for zlib
to be used for compiling a MirageOS unikernel with
ocaml-git. Today, this little project passes a major release with
substantial improvements in several domains.
decompress
provides an API for inflating and deflating flows[1]
. The main
goal is to provide a platform-agnostic library: one which may be compiled on
any platform, including JavaScript. We surely cannot be faster than C
implementations like zstd or lz4, but we can play some
optimisation tricks to help bridge the gap. Additionally, OCaml can protect the
user against lot of bugs via the type-system and the runtime too (e.g. using
array bounds checking). ocaml-tls
was implemented partly in
response to the famous failure of openssl
; a vulnerability
which could not exist in OCaml.
[1]
: A flow, in MirageOS land, is an abstraction which wants to receive
and/or transmit something under a standard. So it's usual to say a TLS-flow
for example.
API design
The API should be the most difficult part of designing a library - it reveals what we can do and how we should do it. In this way, an API should:
-
constrain the user to avoid security issues; too much freedom can be a bad thing. As an example, consider the
Hashtbl.create
function, which allows the user to pass~random:false
to select a fixed hash function. The resulting hashtable suffers deterministic key collisions, which can be exploited by an attacker.
An example of good security-awareness in API design can be seen in digestif, which provided anunsafe_compare
instead of the commoncompare
function (beforeeqaf.0.5
). In this way, it enforced the user to create an alias of it if they want to use a hash in aMap
– however, by this action, they should know that they are not protected against a timing-attack. -
allow some degrees of freedom to fit within many environments; a constrained API cannot support a hostile context. For example, when compiling to an ESP32 target, even small details such as the length of a stream input buffer must be user-definable. When deploying to a server, memory consumption should be deterministic.
Of course, this is made difficult when too much freedom will enable misuse of the API – an example is dune which wants consciously to limit the user about what they can do with it. -
imply an optimal design of how to use it. Possibilities should serve the user, but these can make the API harder to understand; this is why documentation is important. Your API should tell your users how it wants to be treated.
A dbuenzli API
From our experiences with protocol/format, one design stands out: the dbuenzli API. If you look into some famous libraries in the OCaml eco-system, you probably know uutf, jsonm or xmlm. All of these libraries provide the same API for computing a Unicode/JSON/XML flow – of course, the details are not the same.
From a MirageOS perspective, even if they use the in_channel
/out_channel
abstraction rather than a Mirage flow, these libraries
are system-agnostic since they let the user to choose input and output buffers.
Most importantly, they don't use the standard OCaml Unix
module, which cannot
be used in a unikernel.
The APIs are pretty consistent and try to do their best-effort[2]
of
decoding. The design has a type state which represents the current system
status; the user passes this to decode
/encode
to carry out the processing.
Of course, these functions have a side-effect on the state internally, but
this is hidden from the user. One advantage of including states in a design is
that the underlying implementation is very amenable to compiler optimisations (e.g.
tail-call optimisation). Internally, of course, we have a porcelain[3]
implementation where any details can have an rational explanation.
In the beginning, decompress
wanted to follow the same interface without the
mutability (a choice about performances) and it did. Then, the hard test was to
use it in a bigger project; in this case, ocaml-git. An iterative
process was used to determine what was really needed, what we should not provide
(like special cases) and what we should provide to reach an uniform API that is
not too difficult to understand.
From this experience, we finalised the initial decompress
API and it did not
change significantly for 4 versions (2 years).
[2]
: best-effort means an user control on the error branch where we don't
leak exception (or more generally, any interrupts)
[3]
: porcelain means implicit invariants held in the mind of the programmer
(or the assertions/comments).
The new decompress
API
The new decompress
keeps the same inflation logic, but drastically changes the
deflator to make the flush operation clearer. For many purposes, people don't
want to hand-craft their compressed flows – they just want
of_string
/to_string
functions. However, in some contexts (like a PNG
encoder/decoder), the user should be able to play with decompress
in detail
(OpenPGP needs this too in RFC 4880).
The Zlib format
Both decompress
and zlib use Huffman coding, an algorithm
for building a dictionary of variable-length codewords for a given set of
symbols (in this case, bytes). The most common byte is assigned the shortest bit
sequence; less common bytes are assigned longer codewords. Using this
dictionary, we just translate each byte into its codeword and we should achieve
a good compression ratio. Of course, there are other details, such as the fact
that all Huffman codes are prefix-free. The compression can be
taken further with the LZ77 algorithm.
The zlib format, a superset of the RFC 1951 format, is easy to understand. We will only consider the RFC 1951 format, since zlib adds only minor details (such as checksums). It consists of several blocks: DEFLATE blocks, each with a little header, and the contents. There are 3 kinds of DEFLATE blocks:
- a FLAT block; no compression, just a blit from inputs to the current block.
- a FIXED block; compressed using a pre-computed Huffman code.
- a DYNAMIC block; compressed using a user-specified Huffman code.
The FIXED block uses a Huffman dictionary that is computed when the OCaml runtime is initialised. DYNAMIC blocks use dictionaries specified by the user, and so these must be transmitted alongside the data (after being compressed with another Huffman code!). The inflator decompresses this DYNAMIC dictionary and uses it to do the reverse translation from bit sequences to bytes.
Inflator
The design of the inflator did not change a lot from the last version of
decompress
. Indeed, it's about to take an input, compute it and return an
output like a flow. Of course, the error case can be reached.
So the API is pretty-easy:
val decode : decoder -> [ `Await | `Flush | `End | `Malformed of string ]
As you can see, we have 4 cases: one which expects more inputs (Await
), the
second which asks to the user to flush internal buffer (Flush
), the End
case
when we reach the end of the flow and the Malformed
case when we encounter an
error.
For each case, the user can do several operations. Of course, about the Await
case, they can refill the contents with an other inputs buffer with:
val src : decoder -> bigstring -> off:int -> len:int -> unit
This function provides the decoder a new input with len
bytes to read
starting at off
in the given bigstring
.
In the Flush
case, the user wants some information like how many bytes are
available in the current output buffer. Then, we should provide an action to
flush this output buffer. In the end, this output buffer should be given by
the user (how many bytes they want to allocate to store outputs flow).
type src = [ `Channel of in_channel | `Manual | `String of string ]
val dst_rem : decoder -> int
val flush : decoder -> unit
val decoder : src -> o:bigstring -> w:bigstring -> decoder
The last function, decoder
, is the most interesting. It lets the user, at the
beginning, choose the context in which they want to inflate inputs. So they
choose:
src
, where come from inputs flowo
, output bufferw
, window buffer
o
will be used to store inflated outputs, dst_rem
will give to us how many
bytes inflator stored in o
and flush
will just set decoder
to be able to
recompute the flow.
w
is needed for lz77 compression. However, as we said, we let
the user give us this intermediate buffer. The idea behind that is to let the
user prepare an inflation. For example, in ocaml-git, instead of
allocating w
systematically when we want to decompress a Git object, we
allocate w
one time per threads and all are able to use it and re-use it.
In this way, we avoid systematic allocations (and allocate only once time) which
can have a serious impact about performances.
The design is pretty close to one idea, a description step by the decoder
function and a real computation loop with the decode
function. The idea is to
prepare the inflation with some information (like w
and o
) before the main
(and the most expensive) computation. Internally we do that too (but it's mostly
about a macro-optimization).
It's the purpose of OCaml in general, be able to have a powerful way to describe something (with constraints). In our case, we are very limited to what we need to describe. But, in others libraries like angstrom, the description step is huge (describe the parser according to the BNF) and then, we use it to the main computation, in the case of angstrom, the parsing (another example is [cmdliner][cmdliner]).
This is why decoder
can be considered as the main function where decode
can
be wrapped under a stream.
Deflator
The deflator is a new (complex) deal. Indeed, behind it we have two concepts:
- the encoder (according to RFC 1951)
- the compressor
For this new version of decompress
, we decide to separate these concepts where
one question leads all: how to put my compression algorithm? (instead to use
LZ77).
In fact, if you are interested in compression, several algorithms exist and, in some context, it's preferable to use lzwa for example or rabin's fingerprint (with duff), etc.
Functor
The first idea was to provide a functor which expects an implementation of the compression algorithm. However, the indirection of a functor comes with (big) performance cost. Consider the following functor example:
module type S = sig
type t
val add : t -> t -> t
val one : t
end
module Make (S : S) = struct let succ x = S.add x S.one end
include Make (struct
type t = int
let add a b = a + b
let one = 1
end)
let f x = succ x
Currently, with OCaml 4.07.1, the f
function will be a caml_apply2
. We might
wish for a simple inlining optimisation, allowing f
to become an
addq
instruction (indeed, flambda
does this), but optimizing
functors is hard. As we learned from Pierre Chambart, it is possible
for the OCaml compiler to optimize functors directly, but this requires
respecting several constraints that are difficult to respect in practice.
Split encoder and compressor
So, the choice was done to made the encoder which respects RFC 1951 and the compressor under some constraints. However, this is not what zlib did and, by this way, we decided to provide a new design/API which did not follow, in first instance, zlib (or some others implementations like miniz).
To be fair, the choice from zlib and miniz comes from the first point about API and the context where they are used. The main problem is the shared queue between the encoder and the compressor. In C code, it can be hard for the user to deal with it (where they are liable for buffer overflows).
In OCaml and for decompress
, the shared queue can be well-abstracted and API
can ensure assumptions (like bounds checking).
Even if this design is much more complex than before, coverage tests are better
where we can separately test the encoder and the compressor. It breaks down the
initial black-box where compression was intrinsec with encoding – which was
error-prone. Indeed, decompress
had a bug about generation of
Huffman codes but we never reached it because the (bad)
compressor was not able to produce something (a specific lengh with a specific
distance) to get it.
NOTE: you have just read the main reason for the new version of decompress
!
The compressor
The compressor is the most easy part. The goal is to produce from an inputs flow, an outputs flow which is an other (more compacted) representation. This representation consists to:
- A literal, the byte as is
- A copy code with an offset and a length
The last one say to copy length byte(s) from offset. For example, aaaa
can
be compressed as [ Literal 'a'; Copy (offset:1, len:3) ]
. By this way, instead
to have 4 bytes, we have only 2 elements which will be compressed then by an
Huffman coding. This is the main idea of the lz77
compression.
However, the compressor should need to deal with the encoder. An easy interface, à la uutf should be:
val compress : state -> [ `Literal of char | `Copy of (int * int) | `End | `Await ]
But as I said, we need to feed a queue instead.
At this point, the purpose of the queue is not clear and not really explained.
The signature above still is a valid and understandable design. Then, we can
imagine passing Literal
and Copy
directly to the encoder. However, we should
(for performance purpose) use a delaying tactic between the compressor and the
deflator[^4].
Behind this idea, it's to be able to implement an hot-loop on the encoder which will iter inside the shared queue and transmit/encode contents directly to the outputs buffer.
So, when we make a new state
, we let the user supply their queue:
val state : src -> w:bistring -> q:queue -> state
val compress : state -> [ `Flush | `Await | `End ]
The Flush
case appears when the queue is full. Then, we refind the w
window
buffer which is needed to produce the Copy
code. A copy code is limited
according RFC 1951 where offset can not be upper than the length of the window
(commonly 32ko). length is limited too to 258
(an arbitrary choice).
Of course, about the Await
case, the compressor comes with a src
function as
the inflator. Then, we added some accessors, literals
and distances
. The
compressor does not build the Huffman coding which needs
frequencies, so we need firstly to keep counters about that inside the state and
a way to get them (and pass them to the encoder).
[4]
: About that, you should be interesting by the reason of why GNU yes is so
fast where the secret is just about buffering.
The encoder
Finally, we can talk about the encoder which will take the shared queue filled by the compressor and provide an RFC 1951 compliant output flow.
However, we need to consider a special detail. When we want to make a DYNAMIC block from frequencies and then encode the inputs flow, we can reach a case where the shared queue contains an opcode (a literal or a copy) which does not appear in our dictionary.
In fact, if we want to encode [ Literal 'a'; Literal 'b' ]
, we will not try to
make a dictionary which will contains the 256 possibilities of a byte but we
will only make a dictionary from frequencies which contains only 'a'
and
'b'
. By this way, we can reach a case where the queue contains an opcode
(like Literal 'c'
) which can not be encoded by the pre-determined
Huffman coding – remember, the DYNAMIC block starts with
the dictionary.
Another point is about inputs. The encoder expects, of course, contents from the shared queue but it wants from the user the way to encode contents: which block we want to emit. So it has two entries:
- the shared queue
- an user-entry
So for many real tests, we decided to provide this kind of API:
type dst = [ `Channel of out_channel | `Buffer of Buffer.t | `Manual ]
val encoder : dst -> q:queue -> encoder
val encode : encoder -> [ `Block of block | `Flush | `Await ] -> [ `Ok | `Partial | `Block ]
val dst : encoder -> bigstring -> off:int -> len:int -> unit
As expected, we take the shared queue to make a new encoder. Then, we let the
user to specify which kind of block they want to encode by the Block
operation.
The Flush
operation tries to encode all elements present inside the shared
queue according to the current block and feed the outputs buffer. From it, the
encoder can returns some values:
Ok
and the encoder encoded all opcode from the shared queuePartial
, the outputs buffer is not enough to encode all opcode, the user should flush it and give to us a new empty buffer withdst
. Then, they must continue with theAwait
operation.Block
, the encoder reachs an opcode which can not be encoded with the current block (the current dictionary). Then, the user must continue with a newBlock
operation.
The hard part is about the ping-pong game between the user and the encoder
where a Block
expects a Block
response from the user and a Partial
expects
an Await
response. But this design reveals something higher about zlib
this time: the flush mode.
The flush mode
Firstly, we talk about mode because zlib does not allow the user to
decide what they want to do when we reach a Block
or a Ok
case. So, it
defines some under-specified modes to apply a policy of what
to do in this case.
In decompress
, we followed the same design and see that it may be not a good
idea where the logic is not very clear and the user wants may be an another
behavior. It was like a black-box with a black-magic.
Because we decided to split encoder and compressor, the idea of the flush mode does not exists anymore where the user explicitly needs to give to the encoder what they want (make a new block? which block? keep frequencies?). So we broke the black-box. But, as we said, it was possible mostly because we can abstract safely the shared queue between the compressor and the encoder.
OCaml is an expressive language and we can really talk about a queue where, in
C, it will be just an other array. As we said, the deal is about performance,
but now, we allow the user the possibility to write their code in this corner-case
which is when they reachs Block
. Behaviors depends only on them.
APIs in general
The biggest challenge of building a library is defining the API - you must strike a compromise between allowing the user the flexibility to express their use-case and constraining the user to avoid API misuse. If at all possible, provide an intuitive API: force the user not to need to think about security issues, memory consumption or performance.
Avoid making your API so expressive that it becomes unusable, but beware that
this sets hard limits on your users: the current decompress
API can be used to
build of_string
/ to_string
functions, but the opposite is not true - you
definitely cannot build a stream API from of_string
/ to_string
.
The best advice when designing a library is to keep in mind what you really want and let the other details fall into place gradually. It is very important to work in an iterative loop of repeatedly trying to use your library; only this can highlight bad design, corner-cases and details.
Finally, use and re-use it on your tests (important!) and inside higher-level
projects to give you interesting questions about your design. The last version
of decompress
was not used in ocaml-git mostly because the flush
mode was unclear.
Open-Source Development
Tarides champions open-source development. We create and maintain key features of the OCaml language in collaboration with the OCaml community. To learn more about how you can support our open-source work, discover our page on GitHub.
Stay Updated on OCaml and MirageOS!
Subscribe to our mailing list to receive the latest news from Tarides.
By signing up, you agree to receive emails from Tarides. You can unsubscribe at any time.