Decompress: Experiences with OCaml optimization
Software Engineer
In our first article we mostly discussed
the API design of decompress
and did not talk too much about the issue of
optimizing performance. In this second article, we will relate our experiences
of optimizing decompress
.
As you might suspect, decompress
needs to be optimized a lot. It was used by
several projects as an underlying layer of some formats (like Git), so it can be
a real bottleneck in those projects. Of course, we start with a footgun by using
a garbage-collected language; comparing the performance of decompress
with a C
implementation (like zlib or miniz) is obviously not very fair.
However, using something like decompress
instead of C implementations can be
very interesting for many purposes, especially when thinking about unikernels.
As we said in the previous article, we can take the advantage of the runtime
and the type-system to provide something safer (of course, it's not really
true since zlib has received several security audits).
The main idea in this article is not to give snippets to copy/paste into your codebase but to explain some behaviors of the compiler / runtime and hopefully give you some ideas about how to optimize your own code. We'll discuss the following optimizations:
- specialization
- inlining
- untagged integers
- exceptions
- unrolling
- hot-loop
- caml_modify
- representation sizes
Cautionary advice
Before we begin discussing optimization, keep this rule in mind:
Only perform optimization at the end of the development process.
An optimization pass can change your code significantly, so you need to keep a state of your project that can be trusted. This state will provide a comparison point for both benchmarks and behaviors. In other words, your stable implementation will be the oracle for your benchmarks. If you start with nothing, you'll achieve arbitrarily-good performance at the cost of arbitrary behavior!
We optimized decompress
because we are using it in bigger projects for a long
time (2 years). So we have an oracle (even if zlib
can act as an oracle in
this special case).
Specialization
One of the biggest specializations in decompress
is regarding the min
function. If you don't know, in OCaml min
is polymorphic; you can compare
anything. So you probably have some concerns about how min
is implemented?
You are right to be concerned: if you examine the details, min
calls the C
function do_compare_val
, which traverses your structure and does a comparison
according the run-time representation of your structure. Of course, for integers, it
should be only a cmpq
assembly instruction. However, some simple code like:
let x = min 0 1
will produce this CMM and assembly code:
(let x/1002 (app{main.ml:1,8-15} "camlStdlib__min_1028" 1 3 val)
...)
.L101:
movq $3, %rbx
movq $1, %rax
call camlStdlib__min_1028@PLT
Note that beta-reduction, inlining and specialization were not done in this code. OCaml does not optimize your code very much – the good point is predictability of the produced assembly output.
If you help the compiler a little bit with:
external ( <= ) : int -> int -> bool = "%lessequal"
let min a b = if a <= b then a else b [@@inline]
let x = min 0 1
We have:
(function{main.ml:2,8-43} camlMain__min_1003 (a/1004: val b/1005: val)
(if (<= a/1004 b/1005) a/1004 b/1005))
(function camlMain__entry ()
(let x/1006 1 (store val(root-init) (+a "camlMain" 8) 1)) 1a)
.L101:
cmpq %rbx, %rax
jg .L100
ret
So we have all optimizations, in this produced code, x
was evaluated as 0
(let x/... (store ... 1)
) (beta-reduction and inlining) and min
was
specialized to accept only integers – so we are able to emit cmpq
.
Results
With specialization, we won 10 Mb/s on decompression, where min
is used
in several places. We completely avoid an indirection and a call to the slow
do_compare_val
function.
This kind of specialization is already done by flambda
, however, we
currently use OCaml 4.07.1. So we decided to this kind of optimization by
ourselves.
Inlining
In the first example, we showed code with the [@@inline]
keyword which is
useful to force the compiler to inline a little function. We will go outside the
OCaml world and study C code (gcc 5.4.0) to really understand
inlining.
In fact, inlining is not necessarily the best optimization. Consider the following (nonsensical) C program:
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#ifdef HIDE_ALIGNEMENT
__attribute__((noinline, noclone))
#endif
void *
hide(void * p) { return p; }
int main(int ac, const char *av[])
{
char *s = calloc(1 << 20, 1);
s = hide(s);
memset(s, 'B', 100000);
clock_t start = clock();
for (int i = 0; i < 1280000; ++i)
s[strlen(s)] = 'A';
clock_t end = clock();
printf("%lld\n", (long long) (end-start));
return 0;
}
We will compile this code with -O2
(the second level of optimization in C),
once with -DHIDE_ALIGNEMENT
and once without. The assembly emitted differs:
.L3:
movq %rbp, %rdi
call strlen
subl $1, %ebx
movb $65, 0(%rbp,%rax)
jne .L3
.L3:
movl (%rdx), %ecx
addq $4, %rdx
leal -16843009(%rcx), %eax
notl %ecx
andl %ecx, %eax
andl $-2139062144, %eax
je .L3
In the first output (with -DHIDE_ALIGNEMENT
), the optimization pass
decides to disable inlining of strlen
; in the second output (without
-DHIDEAlIGNEMENT
), it decides to inline strlen
(and do some other clever
optimizations). The reason behind this complex behavior from the compiler is
clearly described here.
But what we want to say is that inlining is not an automatic optimization;
it might act as a pessimization. This is the goal of flambda
: do the right
optimization under the right context. If you are really curious about what gcc
does and why, even if it's very interesting, the reverse engineering of the
optimization process and which information is relevant about the choice to
optimize or not is deep, long and surely too complicated.
A non-spontaneous optimization is to annotate some parts of your code with
[@@inline never]
– so, explicitly say to the compiler to not inline the
function. This constraint is to help the compiler to generate a smaller code
which will have more chance to fit under the processor cache.
For all of these reasons, [@@inline]
should be used sparingly and an oracle to
compare performances if you inline or not this or this function is necessary to
avoid a pessimization.
In decompress
Inlining in decompress
was done on small functions which need to allocate
to return a value. If we inline them, we can take the opportunity to store
returned value in registers (of course, it depends how many registers are free).
As we said, the goal of the inflator is to translate a bit sequence to a byte. The largest bit sequence possible according to RFC 1951 has length 15. So, when we process an inputs flow, we eat it 15 bits per 15 bits. For each packet, we want to recognize an existing associated bit sequence and then, binded values will be the real length of the bit sequence and the byte:
val find : bits:int -> { len: int; byte: int; }
So for each call to this function, we need to allocate a record/tuple. It's
why we choose to inline this function. min
was inlined too and some other
small functions. But as we said, the situation is complex; where we think that
inlining can help us, it's not systematically true.
NOTE: we can recognize bits sequence with, at most, 15 bits because a Huffman coding is prefix-free.
Untagged integers
When reading assembly, the integer 0
is written as $1
.
It's because of the GC bit needed to differentiate a pointer
and an unboxed integer. This is why, in OCaml, we talk about a 31-bits integer
or a 63-bits integer (depending on your architecture).
We will not try to start a debate about this arbitrary choice on the representation of an integer in OCaml. However, we can talk about some operations which can have an impact on performances.
The biggest example is about the mod
operation. Between OCaml and C, %
or
mod
should be the same:
let f a b = a mod b
The output assembly is:
.L105:
movq %rdi, %rcx
sarq $1, %rcx // b >> 1
movq (%rsp), %rax
sarq $1, %rax // a >> 1
testq %rcx, %rcx // b != 0
je .L107
cqto
idivq %rcx // a % b
jmp .L106
.L107:
movq caml_backtrace_pos@GOTPCREL(%rip), %rax
xorq %rbx, %rbx
movl %ebx, (%rax)
movq caml_exn_Division_by_zero@GOTPCREL(%rip), %rax
call caml_raise_exn@PLT
.L106:
salq $1, %rdx // x << 1
incq %rdx // x + 1
movq %rbx, %rax
where idiomatically the same C code produce:
.L2:
movl -12(%rbp), %eax
cltd
idivl -8(%rbp)
movl %edx, -4(%rbp)
Of course, we can notice firstly the exception in OCaml (Divided_by_zero
) -
which is pretty good because it protects us against an interrupt from assembly
(and keep the trace). Then, we need to untag a
and b
with sarq
assembly
operation. We do, as the C code, idiv
and then we must retag returned value
x
with salq
and incq
.
So in some parts, it should be more interesting to use Nativeint
. However, by
default, a nativeint
is boxed. boxed means that the value is allocated in
the OCaml heap alongside a header.
Of course, this is not what we want so, if our nativeint ref
(to have
side-effect, like x
) stay inside a function and then, you return the real
value with the deref !
operator, OCaml, by a good planet alignment, can
directly use registers and real integers. So it should be possible to avoid
these needed conversions.
Readability versus performance
We use this optimization only in few parts of the code. In fact, switch
between int
and nativeint
is little bit noisy:
hold := Nativeint.logor !hold Nativeint.(shift_left (of_int (unsafe_get_uint8 d.i !i_pos)) !bits)
In the end, we only gained 0.5Mb/s of inflation rate, so it's not worthwhile to do systematically this optimization. Especially that the gain is not very big. But this case show a more troubling problem: loss of readability.
In fact, we can optimize more and more a code (OCaml or C) but we lost, step by
step, readability. You should be afraid by the implementation of strlen
for
example. In the end, the loss of readability makes it harder to understand the purpose
of the code, leading to errors whenever some other person (or you in 10 years time)
tries to make a change.
And we think that this kind of optimization is not the way of OCaml in general where we prefer to produce an understandable and abstracted code than a cryptic and super fast one.
Again, flambda
wants to fix this problem and let the compiler to do this
optimization. The goal is to be able to write a fast code without any pain.
Exceptions
If you remember our article about the release of base64
, we talked a
bit about exceptions and used them as a jump. In fact, it's pretty
common for an OCaml developer to break the control-flow with an exception.
Behind this common design/optimization, it's about calling convention.
Indeed, choose the jump word to describe OCaml exception is not the best where
we don't use setjmp
/longjmp
.
In the details, when you start a code with a try .. with
, OCaml saves a trap
in the stack which contains information about the with
, the catcher. Then,
when you raise
, you jump directly to this trap and can just discard several
stack frames (and, by this way, you did not check each return codes).
In several places and mostly in the hot-loop, we use this pattern. However, it completely breaks the control flow and can be error-prone.
To limit errors and because this pattern is usual, we prefer to use a local exception which will be used only inside the function. By this way, we enforce the fact that exception should not (and can not) be caught by something else than inside the function.
let exception Break in
( try while !max >= 1 do
if bl_count.(!max) != 0 then raise_notrace Break
; decr max done with Break -> () ) ;
This code above produce this assembly code:
.L105:
pushq %r14
movq %rsp, %r14
.L103:
cmpq $3, %rdi // while !max >= 1
jl .L102
movq -4(%rbx,%rdi,4), %rsi // bl_count,(!max)
cmpq $1, %rsi // bl_count.(!max) != 0
je .L104
movq %r14, %rsp
popq %r14
ret // raise_notrace Break
.L104:
addq $-2, %rdi // decr max
movq %rdi, 16(%rsp)
jmp .L103
Where the ret
is the raise_notrace Break
. A raise_notrace
is needed,
otherwise, you will see:
movq caml_backtrace_pos@GOTPCREL(%rip), %rbx
xorq %rdi, %rdi
movl %edi, (%rbx)
call caml_raise_exn@PLT
Instead the ret
assembly code. Indeed, in this case, we need to store where we
raised the exception.
Unrolling
When we showed the optimization done by gcc
when the string is aligned, gcc
did another optimization. Instead of setting the string byte per byte, it decides to
update it 4 bytes per 4 bytes.
This kind of this optimization is an unroll and we did it in decompress
.
Indeed, when we reach the copy opcode emitted by the lz77
compressor, we want to blit length byte(s) from a source to the outputs
flow. It can appear that this memcpy
can be optimized to copy 4 bytes per 4
bytes – 4 bytes is generally a good idea where it's the size of an int32
and
should fit under any architectures.
let blit src src_off dst dst_off =
if dst_off – src_off < 4
then slow_blit src src_off dst dst_off
else
let len0 = len land 3 in
let len1 = len asr 2 in
for i = 0 to len1 – 1
do
let i = i * 4 in
let v = unsafe_get_uint32 src (src_off + i) in
unsafe_set_uint32 dst (dst_off + i) v ;
done ;
for i = 0 to len0 – 1
do
let i = len1 * 4 + i in
let v = unsafe_get_uint8 src (src_off + i) in
unsafe_set_uint8 dst (dst_off + i) v ;
done
In this code, at the beginning, we copy 4 bytes per 4 bytes and if len
is not
a multiple of 4, we start the trailing loop to copy byte per byte then. In
this context, OCaml can unbox int32
and use registers. So this function does
not deal with the heap, and by this way, with the garbage collector.
Results
In the end, we gained an extra 10Mb/s of inflation rate. The blit
function is the
most important function when it comes to inflating the window to an output flow.
As the specialization on the min
function, this is one of the biggest optimization on
decompress
.
hot-loop
A common design about decompression (but we can find it on hash implementation
too), is the hot-loop. An hot-loop is mainly a loop on the most common
operation in your process. In the context of decompress
, the hot-loop is
about a repeated translation from bits-sequence to byte(s) from the inputs flow
to the outputs flow and the window.
The main idea behind the hot-loop is to initialize all information needed for the translation before to start the hot-loop. Then, it's mostly an imperative loop with a pattern-matching which corresponds to the current state of the global computation.
In OCaml, we can take this opportunity to use int ref
(or nativeint ref
), and then, they will be translated into registers (which is the fastest
area to store something).
Another deal inside the hot-loop is to avoid any allocation – and it's why we
talk about int
or nativeint
. Indeed, a more complex structure like an option
will add a blocker to the garbage collection (a call to caml_call_gc
).
Of course, this kind of design is completely wrong if we think in a functional way. However, this is the (biggest?) advantage of OCaml: hide this ugly/hacky part inside a functional interface.
In the API, we talked about a state which represents the inflation (or the deflation). At the beginning, the goal is to store into some references essentials values like the position into the inputs flow, bits available, dictionary, etc. Then, we launch the hot-loop and only at the end, we update the state.
So we keep the optimal design about inflation and the functional way outside the hot-loop.
caml_modify
One issue that we need to consider is the call to caml_modify
. In
fact, for a complex data-structure like an int array
or a int option
(so,
other than an integer or a boolean or an immediate value), values can move to the
major heap.
In this context, caml_modify
is used to assign a new value into your mutable
block. It is a bit slower than a simple assignment but needed to
ensure pointer correspondence between minor heap and major heap.
With this OCaml code for example:
type t = { mutable v : int option }
let f t v = t.v <- v
We produce this assembly:
camlExample__f_1004:
subq $8, %rsp
movq %rax, %rdi
movq %rbx, %rsi
call caml_modify@PLT
movq $1, %rax
addq $8, %rsp
ret
Where we see the call to caml_modify
which will be take care about the
assignment of v
into t.v
. This call is needed mostly because the type of t.v
is not an immediate value like an integer. So, for many values in the
inflator and the deflator, we mostly use integers.
Of course, at some points, we use int array
and set them at some specific
points of the inflator – where we inflated the dictionary. However, the impact
of caml_modify
is not very clear where it is commonly pretty fast.
Sometimes, however, it can be a real bottleneck in your computation and this depends on how long your values live in the heap. A little program (which is not very reproducible) can show that:
let t = Array.init (int_of_string Sys.argv.(1)) (fun _ -> Random.int 256)
let pr fmt = Format.printf fmt
type t0 = { mutable v : int option }
type t1 = { v : int option }
let f0 (t0 : t0) =
for i = 0 to Array.length t – 1
do let v = match t0.v, t.(i) with
| Some _ as v, _ -> v
| None, 5 -> Some i
| None, _ -> None in
t0.v <- v
done; t0
let f1 (t1 : t1) =
let t1 = ref t1 in
for i = 0 to Array.length t – 1
do let v = match !t1.v, t.(i) with
| Some _ as v, _ -> v
| None, 5 -> Some i
| None, _ -> None in
t1 := { v }
done; !t1
let () =
let t0 : t0 = { v= None } in
let t1 : t1 = { v= None } in
let time0 = Unix.gettimeofday () in
ignore (f0 t0) ;
let time1 = Unix.gettimeofday () in
ignore (f1 t1) ;
let time2 = Unix.gettimeofday () in
pr "f0: %f ns\n%!" (time1 -. time0) ;
pr "f1: %f ns\n%!" (time2 -. time1) ;
()
In our bare-metal server, if you launch the program with 1000, the f0
computation, even if it has caml_modify
will be the fastest. However, if you
launch the program with 1000000000, f1
will be the fastest.
$ ./a.out 1000
f0: 0.000006 ns
f1: 0.000015 ns
$ ./a.out 1000000000
f0: 7.931782 ns
f1: 5.719370 ns
About decompress
At the beginning, our choice was made to have, as @dbuenzli does, mutable structure to represent state. Then, @yallop did a big patch to update it to an immutable state and we won 9Mb/s on inflation.
However, the new version is more focused on the hot-loop and it is 3 times faster than before.
As we said, the deal about caml_modify
is not clear and depends a lot about
how long your data lives in the heap and how many times you want to update it.
If we localize caml_modify
only on few places, it should be fine. But it still
is one of the most complex question about (macro?) optimization.
Smaller representation
We've discussed the impact that integer types can have on the use of immediate values. More generally, the choice of type to represent your values can have significant performance implications.
For example, a dictionary which associates a bits-sequence (an integer) to the
length of it AND the byte, it can be represented by a: (int * int) array
, or
more idiomatically { len: int; byte: int; } array
(which is structurally the
same).
However, that means an allocation for each bytes to represent every bytes.
Extraction of it will need an allocation if find : bits:int -> { len: int; byte: int; }
is not inlined as we said. And about memory, the array can be
really heavy in your heap.
At this point, we used spacetime
to show how many blocks we allocated for a
common inflation and we saw that we allocate a lot. The choice was made to use
a smaller representation. Where len
can not be upper than 15 according RFC 1951
and when byte can represent only 256 possibilities (and should fit under one
byte), we can decide to merge them into one integer (which can have, at least,
31 bits).
let static_literal_tree = [| (8, 12); (8, 140); (8, 76); ... |]
let static_literal_tree = Array.map (fun (len, byte) -> (len lsl 8) lor byte) static_literal_tree
In the code above, we just translate the static dictionary (for a STATIC DEFLATE
block) to a smaller representation where len
will be the left part of the
integer and byte
will be the right part. Of course, it's depends on what you
want to store.
Another point is readability. cstruct-ppx
and
bitstring
can help you but decompress
wants to depend only on OCaml.
Conclusion
We conclude with some closing advice about optimising your OCaml programs:
-
Optimization is specific to your task. The points highlighted in this article may not fit your particular problem, but they are intended to give you ideas. Our optimizations were only possible because we completely assimilated the ideas of
zlib
and had a clear vision of what we really needed to optimize (likeblit
).
As your first project, this article can not help you a lot to optimize your code where it's mostly about micro-optimization under a specific context (hot-loop). But it helps you to understand what is really done by the compiler – which is still really interesting. -
Optimise only with respect to an oracle. All optimizations were done because we did a comparison point between the old implementation of
decompress
andzlib
as oracles. Optimizations can change the semantics of your code and you should systematically take care at any step about expected behaviors. So it's a long run. -
Use the predictability of the OCaml compiler to your advantage. For sure, the compiler does not optimize a lot your code – but it sill produce realistic programs if we think about performance. For many cases, you don't need to optimize your OCaml code. And the good point is about expected behavior.
The mind-link between the OCaml and the assembly exists (much more than the C and the assembly sometimes where we let the C compiler to optimize the code). The cool fact is to keep a mental-model about what is going on on your code easily without to be afraid by what the compiler can produce. And, in some critical parts like eqaf, it's really needed.
We have not discussed benchmarking, which is another hard issue: who should you
compare with? where? how? For example, a global comparison between zlib
and
decompress
is not very relevant in many ways – especially because of the
garbage collector. This could be another article!
Finally, all of these optimizations should be done by flambda
; the difference
between compiling decompress
with or without flambda
is not very big. We
optimized decompress
by hand mostly to keep compatibility with OCaml (since
flambda
needs another switch) and, in this way, to gain an understanding of
flambda
optimizations so that we can use it effectively!
Open-Source Development
Tarides champions open-source development. We create and maintain key features of the OCaml language in collaboration with the OCaml community. To learn more about how you can support our open-source work, discover our page on GitHub.
Stay Updated on OCaml and MirageOS!
Subscribe to our mailing list to receive the latest news from Tarides.
By signing up, you agree to receive emails from Tarides. You can unsubscribe at any time.