Decompress: Experiences with OCaml optimization

by Romain Calascibetta on Sep 13th, 2019

In our first article we mostly discussed the API design of decompress and did not talk too much about the issue of optimizing performance. In this second article, we will relate our experiences of optimizing decompress.

As you might suspect, decompress needs to be optimized a lot. It was used by several projects as an underlying layer of some formats (like Git), so it can be a real bottleneck in those projects. Of course, we start with a footgun by using a garbage-collected language; comparing the performance of decompress with a C implementation (like zlib or miniz) is obviously not very fair.

However, using something like decompress instead of C implementations can be very interesting for many purposes, especially when thinking about unikernels. As we said in the previous article, we can take the advantage of the runtime and the type-system to provide something safer (of course, it's not really true since zlib has received several security audits).

The main idea in this article is not to give snippets to copy/paste into your codebase but to explain some behaviors of the compiler / runtime and hopefully give you some ideas about how to optimize your own code. We'll discuss the following optimizations:

  • specialization
  • inlining
  • untagged integers
  • exceptions
  • unrolling
  • hot-loop
  • caml_modify
  • representation sizes

Cautionary advice

Before we begin discussing optimization, keep this rule in mind:

Only perform optimization at the end of the development process.

An optimization pass can change your code significantly, so you need to keep a state of your project that can be trusted. This state will provide a comparison point for both benchmarks and behaviors. In other words, your stable implementation will be the oracle for your benchmarks. If you start with nothing, you'll achieve arbitrarily-good performance at the cost of arbitrary behavior!

We optimized decompress because we are using it in bigger projects for a long time (2 years). So we have an oracle (even if zlib can act as an oracle in this special case).

Specialization

One of the biggest specializations in decompress is regarding the min function. If you don't know, in OCaml min is polymorphic; you can compare anything. So you probably have some concerns about how min is implemented?

You are right to be concerned: if you examine the details, min calls the C function do_compare_val, which traverses your structure and does a comparison according the run-time representation of your structure. Of course, for integers, it should be only a cmpq assembly instruction. However, some simple code like:

let x = min 0 1

will produce this CMM and assembly code:

(let x/1002 (app{main.ml:1,8-15} "camlStdlib__min_1028" 1 3 val)
   ...)
.L101:
        movq    $3, %rbx
        movq    $1, %rax
        call    camlStdlib__min_1028@PLT

Note that beta-reduction, inlining and specialization were not done in this code. OCaml does not optimize your code very much – the good point is predictability of the produced assembly output.

If you help the compiler a little bit with:

external ( <= ) : int -> int -> bool = "%lessequal"
let min a b = if a <= b then a else b [@@inline]

let x = min 0 1

We have:

(function{main.ml:2,8-43} camlMain__min_1003 (a/1004: val b/1005: val)
 (if (<= a/1004 b/1005) a/1004 b/1005))

(function camlMain__entry ()
 (let x/1006 1 (store val(root-init) (+a "camlMain" 8) 1)) 1a)
.L101:
        cmpq    %rbx, %rax
        jg      .L100
        ret

So we have all optimizations, in this produced code, x was evaluated as 0 (let x/... (store ... 1)) (beta-reduction and inlining) and min was specialized to accept only integers – so we are able to emit cmpq.

Results

With specialization, we won 10 Mb/s on decompression, where min is used in several places. We completely avoid an indirection and a call to the slow do_compare_val function.

This kind of specialization is already done by flambda, however, we currently use OCaml 4.07.1. So we decided to this kind of optimization by ourselves.

Inlining

In the first example, we showed code with the [@@inline] keyword which is useful to force the compiler to inline a little function. We will go outside the OCaml world and study C code (gcc 5.4.0) to really understand inlining.

In fact, inlining is not necessarily the best optimization. Consider the following (nonsensical) C program:

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>

#ifdef HIDE_ALIGNEMENT
__attribute__((noinline, noclone))
#endif
void *
hide(void * p) { return p; }

int main(int ac, const char *av[])
{
  char *s = calloc(1 << 20, 1);
  s = hide(s);

  memset(s, 'B', 100000);

  clock_t start = clock();

  for (int i = 0; i < 1280000; ++i)
    s[strlen(s)] = 'A';

  clock_t end = clock();

  printf("%lld\n", (long long) (end-start));

  return 0;
}

We will compile this code with -O2 (the second level of optimization in C), once with -DHIDE_ALIGNEMENT and once without. The assembly emitted differs:

.L3:
	movq	%rbp, %rdi
	call	strlen
	subl	$1, %ebx
	movb	$65, 0(%rbp,%rax)
	jne	.L3
.L3:
	movl	(%rdx), %ecx
	addq	$4, %rdx
	leal	-16843009(%rcx), %eax
	notl	%ecx
	andl	%ecx, %eax
	andl	$-2139062144, %eax
	je	.L3

In the first output (with -DHIDE_ALIGNEMENT), the optimization pass decides to disable inlining of strlen; in the second output (without -DHIDEAlIGNEMENT), it decides to inline strlen (and do some other clever optimizations). The reason behind this complex behavior from the compiler is clearly described here.

But what we want to say is that inlining is not an automatic optimization; it might act as a pessimization. This is the goal of flambda: do the right optimization under the right context. If you are really curious about what gcc does and why, even if it's very interesting, the reverse engineering of the optimization process and which information is relevant about the choice to optimize or not is deep, long and surely too complicated.

A non-spontaneous optimization is to annotate some parts of your code with [@@inline never] – so, explicitly say to the compiler to not inline the function. This constraint is to help the compiler to generate a smaller code which will have more chance to fit under the processor cache.

For all of these reasons, [@@inline] should be used sparingly and an oracle to compare performances if you inline or not this or this function is necessary to avoid a pessimization.

In decompress

Inlining in decompress was done on small functions which need to allocate to return a value. If we inline them, we can take the opportunity to store returned value in registers (of course, it depends how many registers are free).

As we said, the goal of the inflator is to translate a bit sequence to a byte. The largest bit sequence possible according to RFC 1951 has length 15. So, when we process an inputs flow, we eat it 15 bits per 15 bits. For each packet, we want to recognize an existing associated bit sequence and then, binded values will be the real length of the bit sequence and the byte:

val find : bits:int -> { len: int; byte: int; }

So for each call to this function, we need to allocate a record/tuple. It's why we choose to inline this function. min was inlined too and some other small functions. But as we said, the situation is complex; where we think that inlining can help us, it's not systematically true.

NOTE: we can recognize bits sequence with, at most, 15 bits because a Huffman coding is prefix-free.

Untagged integers

When reading assembly, the integer 0 is written as $1. It's because of the GC bit needed to differentiate a pointer and an unboxed integer. This is why, in OCaml, we talk about a 31-bits integer or a 63-bits integer (depending on your architecture).

We will not try to start a debate about this arbitrary choice on the representation of an integer in OCaml. However, we can talk about some operations which can have an impact on performances.

The biggest example is about the mod operation. Between OCaml and C, % or mod should be the same:

let f a b = a mod b

The output assembly is:

.L105:
        movq    %rdi, %rcx
        sarq    $1, %rcx     // b >> 1
        movq    (%rsp), %rax
        sarq    $1, %rax     // a >> 1
        testq   %rcx, %rcx   // b != 0
        je      .L107
        cqto
        idivq   %rcx         // a % b
        jmp     .L106
.L107:
        movq    caml_backtrace_pos@GOTPCREL(%rip), %rax
        xorq    %rbx, %rbx
        movl    %ebx, (%rax)
        movq    caml_exn_Division_by_zero@GOTPCREL(%rip), %rax
        call    caml_raise_exn@PLT
.L106:
        salq    $1, %rdx     // x << 1
        incq    %rdx         // x + 1
        movq    %rbx, %rax

where idiomatically the same C code produce:

.L2:
        movl    -12(%rbp), %eax
        cltd
        idivl   -8(%rbp)
        movl    %edx, -4(%rbp)

Of course, we can notice firstly the exception in OCaml (Divided_by_zero) - which is pretty good because it protects us against an interrupt from assembly (and keep the trace). Then, we need to untag a and b with sarq assembly operation. We do, as the C code, idiv and then we must retag returned value x with salq and incq.

So in some parts, it should be more interesting to use Nativeint. However, by default, a nativeint is boxed. boxed means that the value is allocated in the OCaml heap alongside a header.

Of course, this is not what we want so, if our nativeint ref (to have side-effect, like x) stay inside a function and then, you return the real value with the deref ! operator, OCaml, by a good planet alignment, can directly use registers and real integers. So it should be possible to avoid these needed conversions.

Readability versus performance

We use this optimization only in few parts of the code. In fact, switch between int and nativeint is little bit noisy:

hold := Nativeint.logor !hold Nativeint.(shift_left (of_int (unsafe_get_uint8 d.i !i_pos)) !bits)

In the end, we only gained 0.5Mb/s of inflation rate, so it's not worthwhile to do systematically this optimization. Especially that the gain is not very big. But this case show a more troubling problem: loss of readability.

In fact, we can optimize more and more a code (OCaml or C) but we lost, step by step, readability. You should be afraid by the implementation of strlen for example. In the end, the loss of readability makes it harder to understand the purpose of the code, leading to errors whenever some other person (or you in 10 years time) tries to make a change.

And we think that this kind of optimization is not the way of OCaml in general where we prefer to produce an understandable and abstracted code than a cryptic and super fast one.

Again, flambda wants to fix this problem and let the compiler to do this optimization. The goal is to be able to write a fast code without any pain.

Exceptions

If you remember our article about the release of base64, we talked a bit about exceptions and used them as a jump. In fact, it's pretty common for an OCaml developer to break the control-flow with an exception. Behind this common design/optimization, it's about calling convention.

Indeed, choose the jump word to describe OCaml exception is not the best where we don't use setjmp/longjmp.

In the details, when you start a code with a try .. with, OCaml saves a trap in the stack which contains information about the with, the catcher. Then, when you raise, you jump directly to this trap and can just discard several stack frames (and, by this way, you did not check each return codes).

In several places and mostly in the hot-loop, we use this pattern. However, it completely breaks the control flow and can be error-prone.

To limit errors and because this pattern is usual, we prefer to use a local exception which will be used only inside the function. By this way, we enforce the fact that exception should not (and can not) be caught by something else than inside the function.

    let exception Break in

    ( try while !max >= 1 do
          if bl_count.(!max) != 0 then raise_notrace Break
        ; decr max done with Break -> () ) ;

This code above produce this assembly code:

.L105:
        pushq   %r14
        movq    %rsp, %r14
.L103:
        cmpq    $3, %rdi              // while !max >= 1
        jl      .L102
        movq    -4(%rbx,%rdi,4), %rsi // bl_count,(!max)
        cmpq    $1, %rsi              // bl_count.(!max) != 0
        je      .L104
        movq    %r14, %rsp
        popq    %r14
        ret                           // raise_notrace Break
.L104:
        addq    $-2, %rdi             // decr max
        movq    %rdi, 16(%rsp)
        jmp     .L103

Where the ret is the raise_notrace Break. A raise_notrace is needed, otherwise, you will see:

        movq    caml_backtrace_pos@GOTPCREL(%rip), %rbx
        xorq    %rdi, %rdi
        movl    %edi, (%rbx)
        call    caml_raise_exn@PLT

Instead the ret assembly code. Indeed, in this case, we need to store where we raised the exception.

Unrolling

When we showed the optimization done by gcc when the string is aligned, gcc did another optimization. Instead of setting the string byte per byte, it decides to update it 4 bytes per 4 bytes.

This kind of this optimization is an unroll and we did it in decompress. Indeed, when we reach the copy opcode emitted by the lz77 compressor, we want to blit length byte(s) from a source to the outputs flow. It can appear that this memcpy can be optimized to copy 4 bytes per 4 bytes – 4 bytes is generally a good idea where it's the size of an int32 and should fit under any architectures.

let blit src src_off dst dst_off =
  if dst_off – src_off < 4
  then slow_blit src src_off dst dst_off
  else
    let len0 = len land 3 in
    let len1 = len asr 2 in

    for i = 0 to len1 – 1
    do
      let i = i * 4 in
      let v = unsafe_get_uint32 src (src_off + i) in
      unsafe_set_uint32 dst (dst_off + i) v ;
    done ;

    for i = 0 to len0 – 1
    do
      let i = len1 * 4 + i in
      let v = unsafe_get_uint8 src (src_off + i) in
      unsafe_set_uint8 dst (dst_off + i) v ;
    done

In this code, at the beginning, we copy 4 bytes per 4 bytes and if len is not a multiple of 4, we start the trailing loop to copy byte per byte then. In this context, OCaml can unbox int32 and use registers. So this function does not deal with the heap, and by this way, with the garbage collector.

Results

In the end, we gained an extra 10Mb/s of inflation rate. The blit function is the most important function when it comes to inflating the window to an output flow. As the specialization on the min function, this is one of the biggest optimization on decompress.

hot-loop

A common design about decompression (but we can find it on hash implementation too), is the hot-loop. An hot-loop is mainly a loop on the most common operation in your process. In the context of decompress, the hot-loop is about a repeated translation from bits-sequence to byte(s) from the inputs flow to the outputs flow and the window.

The main idea behind the hot-loop is to initialize all information needed for the translation before to start the hot-loop. Then, it's mostly an imperative loop with a pattern-matching which corresponds to the current state of the global computation.

In OCaml, we can take this opportunity to use int ref (or nativeint ref), and then, they will be translated into registers (which is the fastest area to store something).

Another deal inside the hot-loop is to avoid any allocation – and it's why we talk about int or nativeint. Indeed, a more complex structure like an option will add a blocker to the garbage collection (a call to caml_call_gc).

Of course, this kind of design is completely wrong if we think in a functional way. However, this is the (biggest?) advantage of OCaml: hide this ugly/hacky part inside a functional interface.

In the API, we talked about a state which represents the inflation (or the deflation). At the beginning, the goal is to store into some references essentials values like the position into the inputs flow, bits available, dictionary, etc. Then, we launch the hot-loop and only at the end, we update the state.

So we keep the optimal design about inflation and the functional way outside the hot-loop.

caml_modify

One issue that we need to consider is the call to caml_modify. In fact, for a complex data-structure like an int array or a int option (so, other than an integer or a boolean or an immediate value), values can move to the major heap.

In this context, caml_modify is used to assign a new value into your mutable block. It is a bit slower than a simple assignment but needed to ensure pointer correspondence between minor heap and major heap.

With this OCaml code for example:

type t = { mutable v : int option }

let f t v = t.v <- v

We produce this assembly:

camlExample__f_1004:
        subq    $8, %rsp
        movq    %rax, %rdi
        movq    %rbx, %rsi
        call    caml_modify@PLT
        movq    $1, %rax
        addq    $8, %rsp
        ret

Where we see the call to caml_modify which will be take care about the assignment of v into t.v. This call is needed mostly because the type of t.v is not an immediate value like an integer. So, for many values in the inflator and the deflator, we mostly use integers.

Of course, at some points, we use int array and set them at some specific points of the inflator – where we inflated the dictionary. However, the impact of caml_modify is not very clear where it is commonly pretty fast.

Sometimes, however, it can be a real bottleneck in your computation and this depends on how long your values live in the heap. A little program (which is not very reproducible) can show that:

let t = Array.init (int_of_string Sys.argv.(1)) (fun _ -> Random.int 256)

let pr fmt = Format.printf fmt

type t0 = { mutable v : int option }
type t1 = { v : int option }

let f0 (t0 : t0) =
  for i = 0 to Array.length t – 1
  do let v = match t0.v, t.(i) with
             | Some _ as v, _ -> v
             | None, 5 -> Some i
             | None, _ -> None in
     t0.v <- v
  done; t0

let f1 (t1 : t1) =
  let t1 = ref t1 in
  for i = 0 to Array.length t – 1
  do let v = match !t1.v, t.(i) with
             | Some _ as v, _ -> v
             | None, 5 -> Some i
             | None, _ -> None in
     t1 := { v }
  done; !t1

let () =
  let t0 : t0 = { v= None } in
  let t1 : t1 = { v= None } in
  let time0 = Unix.gettimeofday () in
  ignore (f0 t0) ;
  let time1 = Unix.gettimeofday () in
  ignore (f1 t1) ;
  let time2 = Unix.gettimeofday () in

  pr "f0: %f ns\n%!" (time1 -. time0) ;
  pr "f1: %f ns\n%!" (time2 -. time1) ;

  ()

In our bare-metal server, if you launch the program with 1000, the f0 computation, even if it has caml_modify will be the fastest. However, if you launch the program with 1000000000, f1 will be the fastest.

$ ./a.out 1000
f0: 0.000006 ns
f1: 0.000015 ns
$ ./a.out 1000000000
f0: 7.931782 ns
f1: 5.719370 ns

About decompress

At the beginning, our choice was made to have, as @dbuenzli does, mutable structure to represent state. Then, @yallop did a big patch to update it to an immutable state and we won 9Mb/s on inflation.

However, the new version is more focused on the hot-loop and it is 3 times faster than before.

As we said, the deal about caml_modify is not clear and depends a lot about how long your data lives in the heap and how many times you want to update it. If we localize caml_modify only on few places, it should be fine. But it still is one of the most complex question about (macro?) optimization.

Smaller representation

We've discussed the impact that integer types can have on the use of immediate values. More generally, the choice of type to represent your values can have significant performance implications.

For example, a dictionary which associates a bits-sequence (an integer) to the length of it AND the byte, it can be represented by a: (int * int) array, or more idiomatically { len: int; byte: int; } array (which is structurally the same).

However, that means an allocation for each bytes to represent every bytes. Extraction of it will need an allocation if find : bits:int -> { len: int; byte: int; } is not inlined as we said. And about memory, the array can be really heavy in your heap.

At this point, we used spacetime to show how many blocks we allocated for a common inflation and we saw that we allocate a lot. The choice was made to use a smaller representation. Where len can not be upper than 15 according RFC 1951 and when byte can represent only 256 possibilities (and should fit under one byte), we can decide to merge them into one integer (which can have, at least, 31 bits).

let static_literal_tree = [| (8, 12); (8, 140); (8, 76); ... |]
let static_literal_tree = Array.map (fun (len, byte) -> (len lsl 8) lor byte) static_literal_tree

In the code above, we just translate the static dictionary (for a STATIC DEFLATE block) to a smaller representation where len will be the left part of the integer and byte will be the right part. Of course, it's depends on what you want to store.

Another point is readability. cstruct-ppx and bitstring can help you but decompress wants to depend only on OCaml.

Conclusion

We conclude with some closing advice about optimising your OCaml programs:

  • Optimization is specific to your task. The points highlighted in this article may not fit your particular problem, but they are intended to give you ideas. Our optimizations were only possible because we completely assimilated the ideas of zlib and had a clear vision of what we really needed to optimize (like blit).

    As your first project, this article can not help you a lot to optimize your code where it's mostly about micro-optimization under a specific context (hot-loop). But it helps you to understand what is really done by the compiler – which is still really interesting.

  • Optimise only with respect to an oracle. All optimizations were done because we did a comparison point between the old implementation of decompress and zlib as oracles. Optimizations can change the semantics of your code and you should systematically take care at any step about expected behaviors. So it's a long run.

  • Use the predictability of the OCaml compiler to your advantage. For sure, the compiler does not optimize a lot your code – but it sill produce realistic programs if we think about performance. For many cases, you don't need to optimize your OCaml code. And the good point is about expected behavior.

    The mind-link between the OCaml and the assembly exists (much more than the C and the assembly sometimes where we let the C compiler to optimize the code). The cool fact is to keep a mental-model about what is going on on your code easily without to be afraid by what the compiler can produce. And, in some critical parts like eqaf, it's really needed.

We have not discussed benchmarking, which is another hard issue: who should you compare with? where? how? For example, a global comparison between zlib and decompress is not very relevant in many ways – especially because of the garbage collector. This could be another article!

Finally, all of these optimizations should be done by flambda; the difference between compiling decompress with or without flambda is not very big. We optimized decompress by hand mostly to keep compatibility with OCaml (since flambda needs another switch) and, in this way, to gain an understanding of flambda optimizations so that we can use it effectively!