Release of Base64

by Romain Calascibetta on Feb 8th, 2019

MirageOS is a library operating system written from the ground up in OCaml. It has an impossible and incredibly huge goal to re-implement all of the world! Looking back at the work accomplished by the MirageOS team, it appears that's what happened for several years. Re-implementing the entire stack, in particular the lower layers that we often take for granted, requires a great attention to detail. While it may seem reasonably easy to implement a given RFC, a huge amount of work is often hidden under the surface.

In this article, we will explain the development process we went through, as we updated a small part of the MirageOS stack: the library ocaml-base64. It's a suitable example as the library is small (few hundreds lines of code), but it needs ongoing development to ensure good quality and to be able to trust it for higher level libraries (like mrmime).

Updating the library was instigated by a problem I ran into with the existing base64 implementation while working on the e-mail stack. Indeed, we got some errors when we tried to compute an encoded-word according to the RFC 2047. So after several years of not being touched, we decided to update ocaml-base64.

The Critique of Pure Reason

The first problem

We started by attempting to use ocaml-base64 on some examples extracted from actual e-mails, and we quickly ran into cases where the library failed. This highlighted that reality is much more complex than you can imagine from reading an RFC. In this situation, what do you do: try to implement a best-effort strategy and continue parsing? Or stick to the letter of the RFC and fail? In the context of e-mails, which has accumulated a lot of baggage over time, you cannot get around implementing a best-effort strategy.

The particular error we were seeing was a Not_found exception when decoding an encoded-word. This exception appeared because the implementation relied on String.contains, and the input contained a character which was not part of the base64 alphabet.

This was the first reason why we thought it necessary to rewrite ocaml-base64. Of course, we could just catch the exception and continue the initial computation, but then another reason appeared.

The second problem

As @clecat and I reviewed RFC 2045, we noticed the following requirement:

The encoded output stream must be represented in lines of no more than 76 characters each.

See RFC 2045, section 6.8

Pretty specific, but general to e-mails, we should never have more than 78 characters per line according to RFC 822, nor more than 998 characters according to RFC 2822.

Having a decoder that abided RFC 2045 more closely, including the requirement above, further spurred us to implement a new decoder.

As part of the new implementation, we decided to implement tests and fuzzers to ensure correctness. This also had the benefit, that we could run the fuzzer on the existing codebase. When fuzzing an encoder/decoder pair, an excellent check is whether the following isomorphism holds:

let iso0 input = assert (decode (encode input) = input)
let iso1 input = assert (encode (decode input) = input)

However, at this point @hannesm ran into another error (see #20).

The third problem

We started to review the nocrypto implementation of base64, which respects our requirements. We had some concerns about the performance of the implementation though, so we decided to see if we would get a performance regression by switching to this implementation.

A quick benchmark based on random input revealed the opposite, however! nocrypto's implementation was faster than ocaml-base64:

ocaml-base64's implementation on bytes (length: 5000): 466 272.34ns
nocrypto's implementation on bytes (length: 5000): 137 406.04ns

Based on all these observations, we thought there was sufficient reason to reconsider the ocaml-base64 implementation. It's also worth mentioning that the last real release (excluding dune/jbuilder/topkg updates) is from Dec. 24 2014. So, it's pretty old code and the OCaml eco-system has improved a lot since 2014.

Implementation & review

We started integrating the nocrypto implementation. Of course, implementing RFC 4648 is not as easy as just reading examples and trying to do something which works. The devil is in the detail.

@hannesm and @cfcs decided to do a big review of expected behavior according to the RFC, and another about implementation and security issues.

Canonicalization

The biggest problem about RFC 4648 is regarding canonical inputs. Indeed, there are cases where two different inputs are associated with the same value:

let a = Base64.decode "Zm9vCg==" ;;
- : string = "foo\n"
let b = Base64.decode "Zm9vCh==" ;;
- : string = "foo\n"

This is mostly because the base64 format encodes the input 6 bits at a time. The result is that 4 base64 encoded bytes are equal to 3 decoded bytes (6 * 4 = 8 * 3). Because of this, 2 base64 encoded bytes provide 1 byte plus 4 bits. What do we need to do with these 4 bits? Nothing.

That's why the last character in our example can be something else than g. g is the canonical byte to indicate using the 2 bits afterward the 6 bits delivered by C (and make a byte - 8 bits). But h can be used where we just need 2 bits at the end.

Due to this behavior, the check used for fuzzing changes: from a canonical input, we should check isomorphism.

Invalid character

As mentioned above ("The first problem"), how should invalid characters be handled? This happens when decoding a byte which is not a part of the base64 alphabet. In the old version, ocaml-base64 would simply leak a Not_found exception from String.contains.

The MirageOS team has taken a stance on exceptions, which is to "use exceptions for exceptional conditions" - invalid input is hardly one of those. This is to avoid any exception leaks, as it can be really hard to track the origin of an exception in a unikernel. Because of this, several packages have been updated to return a result type instead, and we wanted the new implementation to follow suit.

On the other hand, exceptions can be useful when considered as a more constrained form of assembly jump. Of course, they break the control flow, but from a performance point of view, it's interesting to use this trick:

exception Found

let contains str chr =
  let idx = ref 0 in
  let len = String.length str in
  try while !idx < len
      do if String.unsafe_get str !idx = chr then raise Found ; incr idx done ;
      None
  with Found -> Some !idx

This kind of code for example is ~20% faster than String.contains.

As such, exceptions can be a useful tool for performance optimizations, but we need to be extra careful not to expose them to the users of the library. This code needs to be hidden behind a fancy functional interface. With this in mind, we should assert that our decode function never leaks an exception. We'll describe how we've adressed this problem later.

Special cases

RFC 4648 has some detailed cases and while we would sometimes like to work in a perfect world where we will never need to deal with such errors, from our experience, we cannot imagine what the end-user will do to formats, protocols and such.

Even though the RFC has detailed examples, we have to read between lines to know special cases and how to deal with them.

@hannesm noticed one of these cases, where padding (= sign at the end of input) is not mandatory:

The pad character "=" is typically percent-encoded when used in an URI [9], but if the data length is known implicitly, this can be avoided by skipping the padding; see section 3.2.

See RFC 4648, section 5

That mostly means that the following kind of input can be valid:

let a = Base64.decode ~pad:false "Zm9vCg"
- : string = "foo\n"

It's only valid in a specific context though: when length is known implicitly. Only the caller of decode can determine whether the length is implicitly known such that padding can be omitted. To that end, we've added a new optional argument ?pad to the function Base64.decode.

Allocation, sub, ?off and ?len

Xavier Leroy has described the garbage collector in the following way:

You see, the Caml garbage collector is like a god from ancient mythology: mighty, but very irritable. If you mess with it, it'll make you suffer in surprising ways.

That's probably why my experience with improving the allocation policy of (ocaml-git)ocaml-git was quite a nightmare. Allowing the user to control allocation is important for efficiency, and we wanted to ocaml-base64 to be a good citizen.

At the beginning, ocaml-base64 had a very simple API:

val decode : string -> string
val encode : string -> string

This API forces allocations in two ways.

Firstly, if the caller needs to encode a part of a string, this part needs to be extracted, e.g. using String.sub, which will allocate a new string. To avoid this, two new optional arguments have been added to encode: ?off and ?len, which specifies the substring to encode. Here's an example:

(* We want to encode the part 'foo' without prefix or suffix *)

(* Old API -- forces allocation *)
Base64.encode (String.sub "prefix foo suffix" 7 3) ;;
- : string = "Zm9v"

(* New API -- avoids allocation *)
Base64.encode ~off:7 ~len:3 "prefix foo suffix" ;;
- : string = "Zm9v"

Secondly, a new string is allocated to hold the resulting string. We can calculate a bound on the length of this string in the following manner:

let (//) x y =
  if y < 1 then raise Division_by_zero ;
  if x > 0 then 1 + ((x - 1) / y) else 0

let encode input =
  let res = Bytes.create (String.length input // 3 * 4) in
  ...

let decode input =
  let res = Bytes.create (String.length input // 4 * 3) in
  ...

Unfortunately we cannot know the exact length of the result prior to computing it. This forces a call to String.sub at the end of the computation to return a string of the correct length. This means we have two allocations rather than one. To avoid the additional allocation, [@avsm][avsm] proposed to provide a new type sub = string * int * int. This lets the user call String.sub if required (and allocate a new string), or use simply use the returned sub for _blit_ting to another buffer or similar.

Fuzz everything!

There's a strong trend of fuzzing libraries for MirageOS, which is quite easy thanks to the brilliant work by @yomimono and @stedolan! The integrated fuzzing in OCaml builds on American fuzzy lop, which is very smart about discarding paths of execution that have already been tested and generating unseen inputs which break your assumptions. My first experience with fuzzing was with the library decompress, and I was impressed by precise error it found about a name clash.

Earlier in this article, I listed some properties we wanted to check for ocaml-base64:

  • The functions encode and decode should be be isomorphic taking canonicalization into account:
let iso0 input =
  match Base64.decode ~pad:false input with
  | Error _ -> fail ()
  | Ok result0 ->
    let result1 = Base64.encode_exn result0 in
    match Base64.decode ~pad:true result1 with
    | Error _ -> fail ()
    | Ok result2 -> check_eq result0 result2

let iso1 input =
  let result = Base64.encode_exn input in
  match Base64.decode ~pad:true result0 with
  | Error _ -> fail ()
  | Ok result1 ->
    let result2 = Base64.encode_exn result1 in
    check_eq result0 result2
  • The function decode should never raise an exception, but rather return a result type:
let no_exn input =
  try ignore @@ Base64.decode input with _ -> fail ()
  • And finally, we should randomize ?off and ?len arguments to ensure that we don't get an Out_of_bounds exception when accessing input.

Just because we've applied fuzzing to the new implementation for a long time, it doesn't mean that the code is completely infallible. People can use our library in an unimaginable way (and it's mostly what happens in the real world) and get an unknowable error.

But, with the fuzzer, we've managed to test some properties across a very wide range of input instead of unit testing with random (or not so random) inputs from our brains. This development process allows fixing the semantics of implementations (even if it's not a formal definition of semantics), but it's better than nothing or outdated documentation.

Conclusion

Based on our recent update to ocaml-base64, this blog post explains our development process as go about rewriting the world to MirageOS, one bit at a time. There's an important point to be made though:

ocaml-base64 is a small project. Currently, the implementation is about 250 lines of code. So it's a really small project. But as stated in the introduction, we are fortunate enough to push the restart button of the computer world - yes, we want to make a new operating system.

That's a massive task, and we shouldn't make it any harder on ourselves than necessary. As such, we need to justify any step, any line of code, and why we decided to spend our time on any change (why we decided to re-implement git for example). So before committing any time to projects, we try to do a deep analysis of the problem, get feedback from others, and find a consensus between what we already know, what we want and what we should have (in the case of ocaml-base64, @hannesm did a look on the PHP implementation and the Go implementation).

Indeed, this is a hard question which nobody can answer perfectly in isolation. So, the story of this update to ocaml-base64 is an invitation for you to enter the arcanas of the computer world through MirageOS :) ! Don't be afraid!