Fuzzing OCamlFormat with AFL and Crowbar

by Etienne Millon on Aug 3rd, 2020

AFL (and fuzzing in general) is often used to find bugs in low-level code like parsers, but it also works very well to find bugs in high level code, provided the right ingredients. We applied this technique to feed random programs to OCamlFormat and found many formatting bugs.

OCamlFormat is a tool to format source code. To do so, it parses the source code to an Abstract Syntax Tree (AST) and then applies formatting rules to the AST.

It can be tricky to correctly format the output. For example, say we want to format (a+b)*c. The corresponding AST will look like Apply("*", Apply ("+", Var "a", Var "b"), Var "c"). A naive formatter would look like this:

let rec format = function
  | Var s -> s
  | Apply (op, e1, e2) ->
      Printf.sprintf "%s %s %s" (format e1) op (format e2)

But this is not correct, as it will print (a+b)*c as a+b*c, which is a different program. In this particular case, the common solution would be to track the relative precedence of the expressions and to emit only necessary parentheses.

OCamlFormat has similar cases. To make sure we do not change a program when formatting it, there is an extra check at the end to parse the output and compare the output AST with the input AST. This ensures that, in case of bugs, OCamlFormat exits with an error rather than changing the meaning of the input program.

When we consider the whole OCaml language, the rules are complex and it is difficult to make sure that we are correctly handling all programs. There are two main failure modes: either we put too many parentheses, and the program does not look good, or we do not put enough, and the AST changes (and OCamlFormat exits with an error). We need a way to make sure that the latter does not happen. Tests work to some extent, but some edge cases happen only when a certain combination of language features is used. Because of this combinatorial explosion, it is impossible to get good coverage using tests only.

Fortunately there is a technique we can use to automatically explore the program space: fuzzing. For a primer on using this technique on OCaml programs, one can refer to this article.

To make this work we need two elements: a random program generator, and a property to check. Here, we are interested in programs that are valid (in the sense that they parse correctly) but do not format correctly. We can use the OCamlFormat internals to do the following:

  1. try to parse input: in case of a parse error, just reject this input as

invalid.

  1. otherwise, with have a valid program. try to format it. If this happens with

no error at all, reject this input as well.

  1. otherwise, it means that the AST changed, comments moved, or something

similar, in a valid program. This is what we are after.

Generating random programs is a bit more difficult. We can feed random strings to AFL, but even with a corpus of existing valid code it will generate many invalid programs. We are not interested in these for this project, we would rather start from valid programs.

A good way to do that is to use Crowbar to directly generate AST values. Thanks to ppx_deriving_crowbar and ppx_import it is possible to generate random values for an external type like Parsetree.structure (the contents of .ml files). Even more fortunately somebody already did the work. Thanks, Mindy!

This approach works really well: it generates 5k-10k programs per second, which is very good performance (AFL starts complaining below 100/s).

Quickly, AFL was able to find crashes related to attributes. These are "labels" attached to various nodes of the AST. For example the expression (x || y) [@a] (logical or between x and y, attach attribute a to the "or" expression) would get formatted as x || y [@a] (attribute a is attached to the y variable). Once again, there is a check in place in OCamlFormat to make sure that it does not save the file in this case, but it would exit with an error.

After the fuzzer has run for a bit longer, it found crashes where comments would jump around in expressions like f (*a*) (*bb*) x. Wait, what? We never told the program generator how to generate comments. Inspecting the intermediate AST, the part in the middle is actually an integer literal with value "(*a*) (*bb*)" (integer literals are represented as strings so that a third party library could add literals for arbitrary precision numbers for example).

AFL comes with a program called afl-tmin that is used to minimize a crash. It will try to find a smaller example of a program that crashes OCamlFormat. It works well even with Crowbar in between. For example it is able to turn (new aaaaaa & [0;0;0;0])[@aaaaaaaaaa] into (0&0)[@a] (neither AFL nor OCamlFormat knows about types, so they can operate on nonsensical programs. Finding a well-typed version of a crash is usually not very difficult, but it has to be done manually).

In total, letting AFL run overnight on a single core (that is relatively short in terms of fuzzing) caused 453 crashes. After minimization and deduplication, this corresponded to about 30 unique issues.

Most of them are related to attributes that OCamlFormat did not try to include in the output, or where it forgot to add parentheses. Fortunately, there are safeguards in OCamlFormat: since it checks that the formatting preserves the AST structure, it will exit with an error instead of outputting a different program.

Once again, fuzzing has proved itself as a powerful technique to find actual bugs (including high-level ones). A possible approach for a next iteration is to try to detect more problems during formatting, such as finding cases where lines are longer than allowed. It is also possible to extend the random program generator so that it tries to generate comments, and let OCamlFormat check that they are all laid out correctly in the output. We look forward to employing fuzzing more extensively for OCamlFormat development in future.