Release of Base64
Software Engineer
MirageOS is a library operating system written from the ground up in OCaml. It has an impossible and incredibly huge goal to re-implement all of the world! Looking back at the work accomplished by the MirageOS team, it appears that's what happened for several years. Re-implementing the entire stack, in particular the lower layers that we often take for granted, requires a great attention to detail. While it may seem reasonably easy to implement a given RFC, a huge amount of work is often hidden under the surface.
In this article, we will explain the development process we went through, as we
updated a small part of the MirageOS stack: the library ocaml-base64
. It's a
suitable example as the library is small (few hundreds lines of code), but it
needs ongoing development to ensure good quality and to be able to trust it for
higher level libraries (like mrmime).
Updating the library was instigated by a problem I ran into with the existing
base64 implementation while working on the e-mail stack. Indeed, we got some
errors when we tried to compute an encoded-word according to the RFC
2047. So after several years of not being touched, we decided to
update ocaml-base64
.
The Critique of Pure Reason
The first problem
We started by attempting to use ocaml-base64
on some examples extracted from
actual e-mails, and we quickly ran into cases where the library failed. This
highlighted that reality is much more complex than you can imagine from reading
an RFC. In this situation, what do you do: try to implement a best-effort
strategy and continue parsing? Or stick to the letter of the RFC and fail? In
the context of e-mails, which has accumulated a lot of baggage over time, you
cannot get around implementing a best-effort strategy.
The particular error we were seeing was a Not_found
exception when decoding an
encoded-word. This exception appeared because the implementation relied on
String.contains
, and the input contained a character which was not part of the
base64 alphabet.
This was the first reason why we thought it necessary to rewrite ocaml-base64
.
Of course, we could just catch the exception and continue the initial
computation, but then another reason appeared.
The second problem
As @clecat and I reviewed RFC 2045, we noticed the following requirement:
The encoded output stream must be represented in lines of no more than 76 characters each.
See RFC 2045, section 6.8
Pretty specific, but general to e-mails, we should never have more than 78 characters per line according to RFC 822, nor more than 998 characters according to RFC 2822.
Having a decoder that abided RFC 2045 more closely, including the requirement above, further spurred us to implement a new decoder.
As part of the new implementation, we decided to implement tests and fuzzers to ensure correctness. This also had the benefit, that we could run the fuzzer on the existing codebase. When fuzzing an encoder/decoder pair, an excellent check is whether the following isomorphism holds:
let iso0 input = assert (decode (encode input) = input)
let iso1 input = assert (encode (decode input) = input)
However, at this point @hannesm ran into another error (see #20).
The third problem
We started to review the nocrypto
implementation of base64, which
respects our requirements. We had some concerns about the performance of the
implementation though, so we decided to see if we would get a performance
regression by switching to this implementation.
A quick benchmark based on random input revealed the opposite, however!
nocrypto
's implementation was faster than ocaml-base64
:
ocaml-base64's implementation on bytes (length: 5000): 466 272.34ns
nocrypto's implementation on bytes (): 137 406.04ns
Based on all these observations, we thought there was sufficient reason to
reconsider the ocaml-base64
implementation. It's also worth mentioning that
the last real release (excluding dune
/jbuilder
/topkg
updates) is from Dec.
24 2014. So, it's pretty old code and the OCaml eco-system has improved a lot
since 2014.
Implementation & review
We started integrating the nocrypto
implementation. Of course, implementing
RFC 4648 is not as easy as just reading examples and trying to do
something which works. The devil is in the detail.
@hannesm and @cfcs decided to do a big review of expected behavior according to the RFC, and another about implementation and security issues.
Canonicalization
The biggest problem about RFC 4648 is regarding canonical inputs. Indeed, there are cases where two different inputs are associated with the same value:
let a = Base64.decode "Zm9vCg==" ;;
- : string = "foo\n"
let b = Base64.decode "Zm9vCh==" ;;
- : string = "foo\n"
This is mostly because the base64 format encodes the input 6 bits at a time. The
result is that 4 base64 encoded bytes are equal to 3 decoded bytes (6 * 4 = 8 * 3
). Because of this, 2 base64 encoded bytes provide 1 byte plus 4 bits. What do
we need to do with these 4 bits? Nothing.
That's why the last character in our example can be something else than g
. g
is the canonical byte to indicate using the 2 bits afterward the 6 bits
delivered by C
(and make a byte - 8 bits). But h
can be used where we just
need 2 bits at the end.
Due to this behavior, the check used for fuzzing changes: from a canonical input, we should check isomorphism.
Invalid character
As mentioned above ("The first problem"), how should invalid characters be
handled? This happens when decoding a byte which is not a part of the base64
alphabet. In the old version, ocaml-base64
would simply leak a Not_found
exception from String.contains
.
The MirageOS team has taken a stance on exceptions, which is
to "use exceptions for exceptional conditions" - invalid input is hardly one of
those. This is to avoid any exception leaks, as it can be really hard to track
the origin of an exception in a unikernel. Because of this, several packages
have been updated to return a result
type instead, and we wanted the new
implementation to follow suit.
On the other hand, exceptions can be useful when considered as a more constrained form of assembly jump. Of course, they break the control flow, but from a performance point of view, it's interesting to use this trick:
exception Found
let contains str chr =
let idx = ref 0 in
let len = String.length str in
try while !idx < len
do if String.unsafe_get str !idx = chr then raise Found ; incr idx done ;
None
with Found -> Some !idx
This kind of code for example is ~20% faster than String.contains
.
As such, exceptions can be a useful tool for performance optimizations, but we
need to be extra careful not to expose them to the users of the library. This
code needs to be hidden behind a fancy functional interface. With this in mind,
we should assert that our decode
function never leaks an exception. We'll
describe how we've adressed this problem later.
Special cases
RFC 4648 has some detailed cases and while we would sometimes like to work in a perfect world where we will never need to deal with such errors, from our experience, we cannot imagine what the end-user will do to formats, protocols and such.
Even though the RFC has detailed examples, we have to read between lines to know special cases and how to deal with them.
@hannesm noticed one of these cases, where padding (=
sign at the end of
input) is not mandatory:
The pad character "=" is typically percent-encoded when used in an URI [9], but if the data length is known implicitly, this can be avoided by skipping the padding; see section 3.2.
See RFC 4648, section 5
That mostly means that the following kind of input can be valid:
let a = Base64.decode ~pad:false "Zm9vCg"
- : string = "foo\n"
It's only valid in a specific context though: when length is known implicitly.
Only the caller of decode
can determine whether the length is implicitly known
such that padding can be omitted. To that end, we've added a new optional
argument ?pad
to the function Base64.decode
.
Allocation, sub
, ?off
and ?len
Xavier Leroy has described the garbage collector in the following way:
You see, the Caml garbage collector is like a god from ancient mythology: mighty, but very irritable. If you mess with it, it'll make you suffer in surprising ways.
That's probably why my experience with improving the allocation policy of
(ocaml-git
)ocaml-git was quite a nightmare. Allowing the user to control
allocation is important for efficiency, and we wanted to ocaml-base64
to be a
good citizen.
At the beginning, ocaml-base64
had a very simple API:
val decode : string -> string
val encode : string -> string
This API forces allocations in two ways.
Firstly, if the caller needs to encode a part of a string, this part needs to be
extracted, e.g. using String.sub
, which will allocate a new string. To avoid
this, two new optional arguments have been added to encode
: ?off
and ?len
,
which specifies the substring to encode. Here's an example:
(* We want to encode the part 'foo' without prefix or suffix *)
(* Old API -- forces allocation *)
Base64.encode (String.sub "prefix foo suffix" 7 3) ;;
- : string = "Zm9v"
(* New API -- avoids allocation *)
Base64.encode ~off:7 ~len:3 "prefix foo suffix" ;;
- : string = "Zm9v"
Secondly, a new string is allocated to hold the resulting string. We can calculate a bound on the length of this string in the following manner:
let (//) x y =
if y < 1 then raise Division_by_zero ;
if x > 0 then 1 + ((x - 1) / y) else 0
let encode input =
let res = Bytes.create (String.length input // 3 * 4) in
...
let decode input =
let res = Bytes.create (String.length input // 4 * 3) in
...
Unfortunately we cannot know the exact length of the result prior to computing
it. This forces a call to String.sub
at the end of the computation to return a
string of the correct length. This means we have two allocations rather than
one. To avoid the additional allocation, [@avsm][avsm] proposed to provide a new
type sub = string * int * int
. This lets the user call String.sub
if
required (and allocate a new string), or use simply use the returned sub
for
_blit_ting to another buffer or similar.
Fuzz everything!
There's a strong trend of fuzzing libraries for MirageOS, which is quite easy
thanks to the brilliant work by @yomimono and @stedolan!
The integrated fuzzing in OCaml builds on American fuzzy lop, which is
very smart about discarding paths of execution that have already been tested and
generating unseen inputs which break your assumptions. My first experience with
fuzzing was with the library decompress
, and I was impressed by
precise error it found about a name clash.
Earlier in this article, I listed some properties we wanted to check for
ocaml-base64
:
- The functions
encode
anddecode
should be be isomorphic taking canonicalization into account:
let iso0 input =
match Base64.decode ~pad:false input with
| Error _ -> fail ()
| Ok result0 ->
let result1 = Base64.encode_exn result0 in
match Base64.decode ~pad:true result1 with
| Error _ -> fail ()
| Ok result2 -> check_eq result0 result2
let iso1 input =
let result = Base64.encode_exn input in
match Base64.decode ~pad:true result0 with
| Error _ -> fail ()
| Ok result1 ->
let result2 = Base64.encode_exn result1 in
check_eq result0 result2
- The function
decode
should never raise an exception, but rather return a result type:
let no_exn input =
try ignore @@ Base64.decode input with _ -> fail ()
- And finally, we should randomize
?off
and?len
arguments to ensure that we don't get anOut_of_bounds
exception when accessing input.
Just because we've applied fuzzing to the new implementation for a long time, it doesn't mean that the code is completely infallible. People can use our library in an unimaginable way (and it's mostly what happens in the real world) and get an unknowable error.
But, with the fuzzer, we've managed to test some properties across a very wide range of input instead of unit testing with random (or not so random) inputs from our brains. This development process allows fixing the semantics of implementations (even if it's not a formal definition of semantics), but it's better than nothing or outdated documentation.
Conclusion
Based on our recent update to ocaml-base64
, this blog post explains our
development process as go about rewriting the world to MirageOS, one bit at a
time. There's an important point to be made though:
ocaml-base64
is a small project. Currently, the implementation is about 250
lines of code. So it's a really small project. But as stated in the
introduction, we are fortunate enough to push the restart button of the computer
world - yes, we want to make a new operating system.
That's a massive task, and we shouldn't make it any harder on ourselves than
necessary. As such, we need to justify any step, any line of code, and why we
decided to spend our time on any change (why we decided to re-implement git
for example). So before committing any time to projects, we try to do a deep
analysis of the problem, get feedback from others, and find a consensus between
what we already know, what we want and what we should have (in the case of
ocaml-base64
, @hannesm did a look on the PHP implementation and the Go
implementation).
Indeed, this is a hard question which nobody can answer perfectly in isolation.
So, the story of this update to ocaml-base64
is an invitation for you to enter
the arcanas of the computer world through MirageOS :) ! Don't be afraid!
Open-Source Development
Tarides champions open-source development. We create and maintain key features of the OCaml language in collaboration with the OCaml community. To learn more about how you can support our open-source work, discover our page on GitHub.
Stay Updated on OCaml and MirageOS!
Subscribe to our mailing list to receive the latest news from Tarides.
By signing up, you agree to receive emails from Tarides. You can unsubscribe at any time.