Release of Base64by Romain Calascibetta on Feb 8th, 2019
MirageOS is a library operating system written from the ground up in OCaml. It has an impossible and incredibly huge goal to re-implement all of the world! Looking back at the work accomplished by the MirageOS team, it appears that's what happened for several years. Re-implementing the entire stack, in particular the lower layers that we often take for granted, requires a great attention to detail. While it may seem reasonably easy to implement a given RFC, a huge amount of work is often hidden under the surface.
In this article, we will explain the development process we went through, as we
updated a small part of the MirageOS stack: the library
ocaml-base64. It's a
suitable example as the library is small (few hundreds lines of code), but it
needs ongoing development to ensure good quality and to be able to trust it for
higher level libraries (like mrmime).
Updating the library was instigated by a problem I ran into with the existing
base64 implementation while working on the e-mail stack. Indeed, we got some
errors when we tried to compute an encoded-word according to the RFC
2047. So after several years of not being touched, we decided to
We started by attempting to use
ocaml-base64 on some examples extracted from
actual e-mails, and we quickly ran into cases where the library failed. This
highlighted that reality is much more complex than you can imagine from reading
an RFC. In this situation, what do you do: try to implement a best-effort
strategy and continue parsing? Or stick to the letter of the RFC and fail? In
the context of e-mails, which has accumulated a lot of baggage over time, you
cannot get around implementing a best-effort strategy.
The particular error we were seeing was a
Not_found exception when decoding an
encoded-word. This exception appeared because the implementation relied on
String.contains, and the input contained a character which was not part of the
This was the first reason why we thought it necessary to rewrite
Of course, we could just catch the exception and continue the initial
computation, but then another reason appeared.
As @clecat and I reviewed RFC 2045, we noticed the following requirement:
The encoded output stream must be represented in lines of no more than 76 characters each.
See RFC 2045, section 6.8
Having a decoder that abided RFC 2045 more closely, including the requirement above, further spurred us to implement a new decoder.
As part of the new implementation, we decided to implement tests and fuzzers to ensure correctness. This also had the benefit, that we could run the fuzzer on the existing codebase. When fuzzing an encoder/decoder pair, an excellent check is whether the following isomorphism holds:
let iso0 input = assert (decode (encode input) = input) let iso1 input = assert (encode (decode input) = input)
We started to review the
nocrypto implementation of base64, which
respects our requirements. We had some concerns about the performance of the
implementation though, so we decided to see if we would get a performance
regression by switching to this implementation.
A quick benchmark based on random input revealed the opposite, however!
nocrypto's implementation was faster than
ocaml-base64's implementation on bytes (length: 5000): 466 272.34ns nocrypto's implementation on bytes (length: 5000): 137 406.04ns
Based on all these observations, we thought there was sufficient reason to
ocaml-base64 implementation. It's also worth mentioning that
the last real release (excluding
topkg updates) is from Dec.
24 2014. So, it's pretty old code and the OCaml eco-system has improved a lot
We started integrating the
nocrypto implementation. Of course, implementing
RFC 4648 is not as easy as just reading examples and trying to do
something which works. The devil is in the detail.
@hannesm and @cfcs decided to do a big review of expected behavior according to the RFC, and another about implementation and security issues.
The biggest problem about RFC 4648 is regarding canonical inputs. Indeed, there are cases where two different inputs are associated with the same value:
let a = Base64.decode "Zm9vCg==" ;; - : string = "foo\n" let b = Base64.decode "Zm9vCh==" ;; - : string = "foo\n"
This is mostly because the base64 format encodes the input 6 bits at a time. The
result is that 4 base64 encoded bytes are equal to 3 decoded bytes (
6 * 4 = 8 * 3). Because of this, 2 base64 encoded bytes provide 1 byte plus 4 bits. What do
we need to do with these 4 bits? Nothing.
That's why the last character in our example can be something else than
is the canonical byte to indicate using the 2 bits afterward the 6 bits
C (and make a byte - 8 bits). But
h can be used where we just
need 2 bits at the end.
Due to this behavior, the check used for fuzzing changes: from a canonical input, we should check isomorphism.
As mentioned above ("The first problem"), how should invalid characters be
handled? This happens when decoding a byte which is not a part of the base64
alphabet. In the old version,
ocaml-base64 would simply leak a
The MirageOS team has taken a stance on exceptions, which is
to "use exceptions for exceptional conditions" - invalid input is hardly one of
those. This is to avoid any exception leaks, as it can be really hard to track
the origin of an exception in a unikernel. Because of this, several packages
have been updated to return a
result type instead, and we wanted the new
implementation to follow suit.
On the other hand, exceptions can be useful when considered as a more constrained form of assembly jump. Of course, they break the control flow, but from a performance point of view, it's interesting to use this trick:
exception Found let contains str chr = let idx = ref 0 in let len = String.length str in try while !idx < len do if String.unsafe_get str !idx = chr then raise Found ; incr idx done ; None with Found -> Some !idx
This kind of code for example is ~20% faster than
As such, exceptions can be a useful tool for performance optimizations, but we
need to be extra careful not to expose them to the users of the library. This
code needs to be hidden behind a fancy functional interface. With this in mind,
we should assert that our
decode function never leaks an exception. We'll
describe how we've adressed this problem later.
RFC 4648 has some detailed cases and while we would sometimes like to work in a perfect world where we will never need to deal with such errors, from our experience, we cannot imagine what the end-user will do to formats, protocols and such.
Even though the RFC has detailed examples, we have to read between lines to know special cases and how to deal with them.
@hannesm noticed one of these cases, where padding (
= sign at the end of
input) is not mandatory:
The pad character "=" is typically percent-encoded when used in an URI , but if the data length is known implicitly, this can be avoided by skipping the padding; see section 3.2.
See RFC 4648, section 5
That mostly means that the following kind of input can be valid:
let a = Base64.decode ~pad:false "Zm9vCg" - : string = "foo\n"
It's only valid in a specific context though: when length is known implicitly.
Only the caller of
decode can determine whether the length is implicitly known
such that padding can be omitted. To that end, we've added a new optional
?pad to the function
Xavier Leroy has described the garbage collector in the following way:
You see, the Caml garbage collector is like a god from ancient mythology: mighty, but very irritable. If you mess with it, it'll make you suffer in surprising ways.
That's probably why my experience with improving the allocation policy of
ocaml-git)ocaml-git was quite a nightmare. Allowing the user to control
allocation is important for efficiency, and we wanted to
ocaml-base64 to be a
At the beginning,
ocaml-base64 had a very simple API:
val decode : string -> string val encode : string -> string
This API forces allocations in two ways.
Firstly, if the caller needs to encode a part of a string, this part needs to be
extracted, e.g. using
String.sub, which will allocate a new string. To avoid
this, two new optional arguments have been added to
which specifies the substring to encode. Here's an example:
(* We want to encode the part 'foo' without prefix or suffix *) (* Old API -- forces allocation *) Base64.encode (String.sub "prefix foo suffix" 7 3) ;; - : string = "Zm9v" (* New API -- avoids allocation *) Base64.encode ~off:7 ~len:3 "prefix foo suffix" ;; - : string = "Zm9v"
Secondly, a new string is allocated to hold the resulting string. We can calculate a bound on the length of this string in the following manner:
let (//) x y = if y < 1 then raise Division_by_zero ; if x > 0 then 1 + ((x - 1) / y) else 0 let encode input = let res = Bytes.create (String.length input // 3 * 4) in ... let decode input = let res = Bytes.create (String.length input // 4 * 3) in ...
Unfortunately we cannot know the exact length of the result prior to computing
it. This forces a call to
String.sub at the end of the computation to return a
string of the correct length. This means we have two allocations rather than
one. To avoid the additional allocation, [@avsm][avsm] proposed to provide a new
sub = string * int * int. This lets the user call
required (and allocate a new string), or use simply use the returned
blitting to another buffer or similar.
There's a strong trend of fuzzing libraries for MirageOS, which is quite easy
thanks to the brilliant work by @yomimono and @stedolan!
The integrated fuzzing in OCaml builds on American fuzzy lop, which is
very smart about discarding paths of execution that have already been tested and
generating unseen inputs which break your assumptions. My first experience with
fuzzing was with the library
decompress, and I was impressed by
precise error it found about a name clash.
Earlier in this article, I listed some properties we wanted to check for
- The functions
decodeshould be be isomorphic taking canonicalization into account:
let iso0 input = match Base64.decode ~pad:false input with | Error _ -> fail () | Ok result0 -> let result1 = Base64.encode_exn result0 in match Base64.decode ~pad:true result1 with | Error _ -> fail () | Ok result2 -> check_eq result0 result2 let iso1 input = let result = Base64.encode_exn input in match Base64.decode ~pad:true result0 with | Error _ -> fail () | Ok result1 -> let result2 = Base64.encode_exn result1 in check_eq result0 result2
- The function
decodeshould never raise an exception, but rather return a result type:
let no_exn input = try ignore @@ Base64.decode input with _ -> fail ()
- And finally, we should randomize
?lenarguments to ensure that we don't get an
Out_of_boundsexception when accessing input.
Just because we've applied fuzzing to the new implementation for a long time, it doesn't mean that the code is completely infallible. People can use our library in an unimaginable way (and it's mostly what happens in the real world) and get an unknowable error.
But, with the fuzzer, we've managed to test some properties across a very wide range of input instead of unit testing with random (or not so random) inputs from our brains. This development process allows fixing the semantics of implementations (even if it's not a formal definition of semantics), but it's better than nothing or outdated documentation.
Based on our recent update to
ocaml-base64, this blog post explains our
development process as go about rewriting the world to MirageOS, one bit at a
time. There's an important point to be made though:
ocaml-base64 is a small project. Currently, the implementation is about 250
lines of code. So it's a really small project. But as stated in the
introduction, we are fortunate enough to push the restart button of the computer
world - yes, we want to make a new operating system.
That's a massive task, and we shouldn't make it any harder on ourselves than
necessary. As such, we need to justify any step, any line of code, and why we
decided to spend our time on any change (why we decided to re-implement
for example). So before committing any time to projects, we try to do a deep
analysis of the problem, get feedback from others, and find a consensus between
what we already know, what we want and what we should have (in the case of
ocaml-base64, @hannesm did a look on the PHP implementation and the Go
Indeed, this is a hard question which nobody can answer perfectly in isolation.
So, the story of this update to
ocaml-base64 is an invitation for you to enter
the arcanas of the computer world through MirageOS :) ! Don't be afraid!