Feb 8, 2019 by Romain C.
Release of Base64
MirageOS is a library operating system written from the ground up in OCaml.
It has an impossible and incredibly huge goal to re-implement all of the
world! Looking back at the work accomplished by the MirageOS team, it appears that's
what happened for several years. Re-implementing the entire stack, in particular
the lower layers that we often take for granted, requires a great attention to
detail. While it may seem reasonably easy to implement a given RFC, a huge
amount of work is often hidden under the surface.
In this article, we will explain the development process we went through, as we
updated a small part of the MirageOS stack: the library
ocaml-base64. It's a
suitable example as the library is small (few hundreds lines of code), but it
needs ongoing development to ensure good quality and to be able to trust it for
higher level libraries (like mrmime).
Updating the library was instigated by a problem I ran into with the existing
base64 implementation while working on the e-mail stack. Indeed, we got some
errors when we tried to compute an encoded-word according to the RFC
2047. So after several years of not being touched, we decided to
The Critique of Pure Reason
The first problem
We started by attempting to use
ocaml-base64 on some examples extracted from
actual e-mails, and we quickly ran into cases where the library failed. This
highlighted that reality is much more complex than you can imagine from reading
an RFC. In this situation, what do you do: try to implement a best-effort
strategy and continue parsing? Or stick to the letter of the RFC and fail? In
the context of e-mails, which has accumulated a lot of baggage over time, you
cannot get around implementing a best-effort strategy.
The particular error we were seeing was a
Not_found exception when decoding an
encoded-word. This exception appeared because the implementation relied on
String.contains, and the input contained a character which was not part of the
This was the first reason why we thought it necessary to rewrite
Of course, we could just catch the exception and continue the initial
computation, but then another reason appeared.
The second problem
As @clecat and I reviewed RFC 2045, we noticed the following
The encoded output stream must be represented in lines of no more than 76
See RFC 2045, section 6.8
Pretty specific, but general to e-mails, we should never have more than 78
characters per line according to RFC 822, nor more than 998 characters
according to RFC 2822.
Having a decoder that abided RFC 2045 more closely, including the requirement
above, further spurred us to implement a new decoder.
As part of the new implementation, we decided to implement tests and fuzzers to
ensure correctness. This also had the benefit, that we could run the fuzzer on
the existing codebase. When fuzzing an encoder/decoder pair, an excellent check
is whether the following isomorphism holds:
let iso0 input = assert (decode (encode input) = input)
let iso1 input = assert (encode (decode input) = input)
However, at this point @hannesm ran into another error (see
The third problem
We started to review the `nocrypto` implementation of base64, which
respects our requirements. We had some concerns about the performance of the
implementation though, so we decided to see if we would get a performance
regression by switching to this implementation.
A quick benchmark based on random input revealed the opposite, however!
nocrypto's implementation was faster than
ocaml-base64's implementation on bytes (length: 5000): 466 272.34ns
nocrypto's implementation on bytes (length: 5000): 137 406.04ns
Based on all these observations, we thought there was sufficient reason to
ocaml-base64 implementation. It's also worth mentioning that
the last real release (excluding
topkg updates) is from Dec.
24 2014. So, it's pretty old code and the OCaml eco-system has improved a lot
Implementation & review
We started integrating the
nocrypto implementation. Of course, implementing
RFC 4648 is not as easy as just reading examples and trying to do
something which works. The devil is in the detail.
@hannesm and @cfcs decided to do a big review of expected behavior
according to the RFC, and another about implementation and security issues.
The biggest problem about RFC 4648 is regarding canonical inputs. Indeed, there
are cases where two different inputs are associated with the same value:
let a = Base64.decode "Zm9vCg==" ;;
- : string = "foo\n"
let b = Base64.decode "Zm9vCh==" ;;
- : string = "foo\n"
This is mostly because the base64 format encodes the input 6 bits at a time. The
result is that 4 base64 encoded bytes are equal to 3 decoded bytes (
6 * 4 = 8 *
3). Because of this, 2 base64 encoded bytes provide 1 byte plus 4 bits. What do
we need to do with these 4 bits? Nothing.
That's why the last character in our example can be something else than
is the canonical byte to indicate using the 2 bits afterward the 6 bits
C (and make a byte - 8 bits). But
h can be used where we just
need 2 bits at the end.
Due to this behavior, the check used for fuzzing changes: from a canonical
input, we should check isomorphism.
As mentioned above ("The first problem"), how should invalid characters be
handled? This happens when decoding a byte which is not a part of the base64
alphabet. In the old version,
ocaml-base64 would simply leak a
The MirageOS team has taken a stance on exceptions, which is
to "use exceptions for exceptional conditions" - invalid input is hardly one of
those. This is to avoid any exception leaks, as it can be really hard to track
the origin of an exception in a unikernel. Because of this, several packages
have been updated to return a
result type instead, and we wanted the new
implementation to follow suit.
On the other hand, exceptions can be useful when considered as a more
constrained form of assembly jump. Of course, they break the control flow, but
from a performance point of view, it's interesting to use this trick:
let contains str chr =
let idx = ref 0 in
let len = String.length str in
try while !idx < len
do if String.unsafe_get str !idx = chr then raise Found ; incr idx done ;
with Found -> Some !idx
This kind of code for example is ~20% faster than
As such, exceptions can be a useful tool for performance optimizations, but we
need to be extra careful not to expose them to the users of the library. This
code needs to be hidden behind a fancy functional interface. With this in mind,
we should assert that our
decode function never leaks an exception. We'll
describe how we've adressed this problem later.
RFC 4648 has some detailed cases and while we would sometimes like to work in a
perfect world where we will never need to deal with such errors, from our
experience, we cannot imagine what the end-user will do to formats, protocols
Even though the RFC has detailed examples, we have to read between lines to know
special cases and how to deal with them.
@hannesm noticed one of these cases, where padding (
= sign at the end of
input) is not mandatory:
The pad character "=" is typically percent-encoded when used in an URI ,
but if the data length is known implicitly, this can be avoided by skipping
the padding; see section 3.2.
See RFC 4648, section 5
That mostly means that the following kind of input can be valid:
let a = Base64.decode ~pad:false "Zm9vCg"
- : string = "foo\n"
It's only valid in a specific context though: when length is known implicitly.
Only the caller of
decode can determine whether the length is implicitly known
such that padding can be omitted. To that end, we've added a new optional
?pad to the function
Xavier Leroy has described the garbage collector in the following way:
You see, the Caml garbage collector is like a god from ancient mythology:
mighty, but very irritable. If you mess with it, it'll make you suffer in
That's probably why my experience with improving the allocation policy of
ocaml-git)ocaml-git was quite a nightmare. Allowing the user to control
allocation is important for efficiency, and we wanted to
ocaml-base64 to be a
At the beginning,
ocaml-base64 had a very simple API:
val decode : string -> string
val encode : string -> string
This API forces allocations in two ways.
Firstly, if the caller needs to encode a part of a string, this part needs to be
extracted, e.g. using
String.sub, which will allocate a new string. To avoid
this, two new optional arguments have been added to
which specifies the substring to encode. Here's an example:
(* We want to encode the part 'foo' without prefix or suffix *)
(* Old API -- forces allocation *)
Base64.encode (String.sub "prefix foo suffix" 7 3) ;;
- : string = "Zm9v"
(* New API -- avoids allocation *)
Base64.encode ~off:7 ~len:3 "prefix foo suffix" ;;
- : string = "Zm9v"
Secondly, a new string is allocated to hold the resulting string. We can
calculate a bound on the length of this string in the following manner:
let (//) x y =
if y < 1 then raise Division_by_zero ;
if x > 0 then 1 + ((x - 1) / y) else 0
let encode input =
let res = Bytes.create (String.length input // 3 * 4) in
let decode input =
let res = Bytes.create (String.length input // 4 * 3) in
Unfortunately we cannot know the exact length of the result prior to computing
it. This forces a call to
String.sub at the end of the computation to return a
string of the correct length. This means we have two allocations rather than
one. To avoid the additional allocation, [@avsm][avsm] proposed to provide a new
sub = string * int * int. This lets the user call
required (and allocate a new string), or use simply use the returned
_blit_ting to another buffer or similar.
There's a strong trend of fuzzing libraries for MirageOS, which is quite easy
thanks to the brilliant work by @yomimono and @stedolan!
The integrated fuzzing in OCaml builds on American fuzzy lop, which is
very smart about discarding paths of execution that have already been tested and
generating unseen inputs which break your assumptions. My first experience with
fuzzing was with the library `decompress`, and I was impressed by
precise error it found about a name clash.
Earlier in this article, I listed some properties we wanted to check for
- The functions
decode should be be isomorphic taking
canonicalization into account:
let iso0 input =
match Base64.decode ~pad:false input with
| Error _ -> fail ()
| Ok result0 ->
let result1 = Base64.encode_exn result0 in
match Base64.decode ~pad:true result1 with
| Error _ -> fail ()
| Ok result2 -> check_eq result0 result2
let iso1 input =
let result = Base64.encode_exn input in
match Base64.decode ~pad:true result0 with
| Error _ -> fail ()
| Ok result1 ->
let result2 = Base64.encode_exn result1 in
check_eq result0 result2
- The function
decode should never raise an exception, but rather return a
let no_exn input =
try ignore @@ Base64.decode input with _ -> fail ()
- And finally, we should randomize
?len arguments to ensure that we
don't get an
Out_of_bounds exception when accessing input.
Just because we've applied fuzzing to the new implementation for a long time, it
doesn't mean that the code is completely infallible. People can use our library
in an unimaginable way (and it's mostly what happens in the real world) and get
an unknowable error.
But, with the fuzzer, we've managed to test some properties across a very wide
range of input instead of unit testing with random (or not so random) inputs
from our brains. This development process allows fixing the semantics of
implementations (even if it's not a formal definition of semantics), but
it's better than nothing or outdated documentation.
Based on our recent update to
ocaml-base64, this blog post explains our
development process as go about rewriting the world to MirageOS, one bit at a
time. There's an important point to be made though:
ocaml-base64 is a small project. Currently, the implementation is about 250
lines of code. So it's a really small project. But as stated in the
introduction, we are fortunate enough to push the restart button of the computer
world - yes, we want to make a new operating system.
That's a massive task, and we shouldn't make it any harder on ourselves than
necessary. As such, we need to justify any step, any line of code, and why we
decided to spend our time on any change (why we decided to re-implement
for example). So before committing any time to projects, we try to do a deep
analysis of the problem, get feedback from others, and find a consensus between
what we already know, what we want and what we should have (in the case of
ocaml-base64, @hannesm did a look on the PHP implementation and the Go
Indeed, this is a hard question which nobody can answer perfectly in isolation.
So, the story of this update to
ocaml-base64 is an invitation for you to enter
the arcanas of the computer world through MirageOS :) ! Don't be afraid!