An introduction to OCaml PPX ecosystem
Senior Software Engineer
These last few months, I spent some time writing new OCaml PPX rewriters or contributing to existing ones. It's a really fun experience. Toying around with the AST taught me a lot about a language I thought I knew really well. Turns out I actually had no idea what I was doing all these years.
All jokes aside, I was surprised that the most helpful tricks I learned while writing PPX rewriters weren't properly documented. There already exist a few very good introduction articles on the subject, like that 2014's article from Whitequark, this more recent one from Rudi Grinberg or even this last one from Victor Darvariu I only discovered after I actually started writing my own. I still felt like they were slightly outdated or weren't answering all the questions I had when I started playing with PPX and writing my first rewriters.
I decided to share my PPX adventures in the hope that it can help others familiarize with this bit of the OCaml ecosystem and eventually write their first rewriters. The scope of this article is not to cover every single detail about the PPX internals but just to give a gentle introduction to beginners to help them get settled. That also means I might omit things that I don't think are worth mentioning or that might confuse the targetted audience but feel free to comment if you believe this article missed an important point.
It's worth mentioning that a lot of the nice tricks mentioned in these lines were given to me by a wonderful human being called Étienne Millon, thanks Étienne!
What is a PPX?
PPX rewriters or PPX-es are preprocessors that are applied to your code before passing it on to the compiler. They don't operate on your code directly but on the Abstract Syntax Tree or AST resulting from its parsing. That means that they can only be applied to syntactically correct OCaml code. You can think of them as functions that take an AST and return a new AST.
That means that in theory you can do a lot of things with a PPX, including pretty bad and cryptic
things. You could for example replace every instance of true
by false
, swap the branches of any
if-then-else
or randomize the order of every pattern-matching case.
Obviously that's not the kind of behaviour that we want as it would make it impossible to
understand the code since it would be so far from the actual AST the compiler would get.
In practice PPX-es have a well defined scope and only transform parts you explicitly annotated.
Understanding the OCaml AST
First things first, what is an AST. An AST is an abstract representation of your code. As the name
suggests it has a tree-like structure where the root describes your entire file. It has children for
each bits such as a function declaration or a type definition, each of them having their own
children, for example for the function name, its argument and its body and that goes on until you
reach a leaf such as a literal 1
, "abc"
or a variable for instance.
In the case of OCaml it's a set of recursive types allowing us to represent OCaml code as an OCaml
value. This value is what the parser passes to the compiler so it can type check and compile it to
native or byte code.
Those types are defined in OCaml's Parsetree
module. The entry points there are the structure
type which describes the content of an .ml
file and the signature
type which describes the
content of an .mli
file.
As mentionned above, a PPX can be seen as a function that transforms an AST. Writing a PPX thus requires you to understand the AST, both to interpret the one you'll get as input and to produce the right one as output. This is probably the trickiest part as unless you've already worked on the OCaml compiler or written a PPX rewriter, that will probably be the first time you two meet. Chances are also high that'll be a pretty bad first date and you will need some to time to get to know each other.
The Parsetree
module documentation,
is a good place to start. The above mentioned structure
and signature
types are at the root of
the AST but some other useful types to look at at first are:
expression
which describes anything in OCaml that evaluates to a value, the right hand side of alet
binding for instance.pattern
which is what you use to deconstruct an OCaml value, the left hand side of alet
binding or a pattern-matching case for example.core_type
which describes type expressions ie what you would find on the right hand side of a value description in a.mli
, ieval f : <what_goes_there>
.structure_item
andsignature_item
which describe the top level AST nodes you can find in astructure
orsignature
such as type definitions, value or module declarations.
Thing is, it's a bit a rough and there's no detailed explanation about how a specific bit of code is
represented, just type definitions. Most of the time, the type, field, and variant names are
self-explanatory but it can get harder with some of the more advanced language features.
It turns out there are plenty of comments that are really helpful in the actual parsetree.mli
file
and that aren't part of the generated documentation. You can find them on
github but I personally prefer to
have it opened in a VIM tab when I work on a PPX so I usually open
~/.opam/<current_working_switch>/lib/ocaml/compiler-libs/parsetree.mli
.
This works well while exploring but you might also want a more straightforward approach to
discovering what the AST representation is for some specific OCaml code. The
ppx_tools
opam package comes with a dumpast
binary
that pretty prints the AST for any given piece of valid OCaml code. You can install it using opam:
$opam install ppx_tools
and then run it using ocamlfind
:
$ocamlfind ppx_tools/dumpast some_file.ml
You can use it on .ml
and .mli
files or to quickly get the AST for an expression with the -e
option:
$ocamlfind ppx_tools/dumpast -e "1 + 1"
Similarly, you can use the -t
or -p
options to respectively pretty print ASTs from type
expressions or patterns.
Using dumpast
to get both the ASTs of a piece of code using your future PPX and the resulting
preprocessed code is a good way to start and will help you figure out what are the steps required to
get there.
Note that you can use the compiler or utop
have a similar feature with the -dparsetree
flag.
Running ocamlc/ocamlopt -dparsetree file.ml
will pretty print the AST of the given file while
running utop -dparsetree
will pretty print the AST of the evaluated code alongside it's
evaluation.
I tend to prefer the pretty printed AST from dumpast
but any of these tools will prove helpful
in understanding the AST representation of a given piece of OCaml code.
Language extensions interpreted by PPX-es
OCaml 4.02 introduced syntax extensions meant to be used by external tools such as PPX-es. Knowing their syntax and meaning is important to understand how most of the existing rewriters work because they usually look for those language extensions in the AST to know which part of it they need to modify.
The two language extensions we're interested in here are extension nodes and attributes. They are defined in detail in the OCaml manual (see the attributes and extension nodes sections) but I'll try to give a good summary here.
Extension nodes are used in place of expressions, module expressions, patterns, type expressions or
module type expressions. Their syntax is [%extension_name payload]
. We'll come back to the payload
part a little later.
You can also find extension nodes at the top level of modules or module signatures with the syntax
[%%extension_name payload]
.
Hopefully the following cheatsheet can help you remember the basics of how and where you can use
them:
type t =
{ a : int
; b : [%ext pl]
}
let x =
match 1 with
| 0 -> [%ext pl]
| [%ext pl] -> true
[%%ext pl]
Because extension nodes stand where regular AST nodes should, the compiler won't accept them and
will give you an Uninterpreted extension
error. Extension nodes have to be expanded by a PPX for
your code to compile.
Attributes are slightly different although their syntax is very close to extensions. Attributes
are attached to existing AST nodes instead of replacing them. That means that they don't necessarily
need to be transformed and the compiler will ignore unknown attributes by default.
They can come with a payload just like extensions and use @
instead of %
. The number of @
preceding the attribute name specifies which kind of node they are attached to:
let a = 12 [@attr pl]
let b = "some string" [@@attr pl]
[@@@attr pl]
In the first example, the attribute is attached to the expression 12
while in the second example
it is attached to the whole let b = "some string"
value binding. The third one is of a slightly
different nature as it is a floating attribute. It's not attached to anything per-se and just ends
up in the AST as a structure item.
Because there is a wide variety of nodes to which you can attach attributes, I won't go too far into
details here but a good rule of thumb is that you use @@
attributes when you want them attached to
structure or signature items, for anything deeper within the AST structure such as patterns,
expressions or core types, use the single @
syntax. Looking at the Parsetree
documentation can
help you figure out where you can find attributes.
Now let's talk about those payloads I mentioned earlier. You can think of them as "arguments" to the extension points and attributes. You can pass different kinds of arguments and the syntax varies for each of them:
let a = [%ext expr_or_str_item]
let b = [%ext: type_expr_or_sig_item]
let c = [%ext? pattern]
As suggested in the examples, you can pass expressions or structure items using a space character,
type expressions or signature items (anything you'd find at the top level of a module signature)
using a :
or a pattern using a ?
.
Attributes' payload use the same syntax:
let a = 'a' [@attr expr_or_str_item]
let b = 'b' [@attr: type_expr_or_sig_item]
let a = 'a' [@attr? pattern]
Some PPX-es rely on other language extensions such as the suffix character you can attach to int
and float
literals (10z
could be used by a PPX to turn it into Z.of_string "10"
for instance)
or quoted strings with a specific identifier ({ppx_name|some quoted string|ppx_name}
can be used
if you want your PPX to operate on arbitrary strings and not only syntactically correct OCaml) but
attributes and extensions are the most commonly used ones.
Attributes and extension points can be expressed using an infix syntax. The attribute version is barely used but some forms of the infix syntax for extension points are used by popular PPX-es and it is likely you will encounter some of the following:
let infix_let_extension =
let%ext x = 2 in
...
let infix_match_extension =
match%ext y with ...
let infix_try_extension =
try%ext f z with _ -> ...
which are syntactic sugar for:
let infix_let_extension =
[%ext let x = 2 in ...]
let infix_match_extension =
[%ext match y with ...]
let infix_try_extension =
[%ext try f z with _ -> ...]
A good example of a PPX making heavy use of these if
lwt_ppx
. The OCaml manual also contains more examples
of the infix syntax in the Attributes and Extension points sections mentioned above.
The two main kind of PPX-es
There is a wide variety of PPX rewriters but the ones you'll probably see the most are Extensions and Derivers.
Extensions
Extensions will rewrite tagged parts of the AST, usually extension nodes of the form
[%<extension_name> payload]
. They will replace them with a different AST node of the same nature ie
if the extension point was located where an expression should be, the rewriter will produce an
expression. Good examples of extensions are:
ppx_getenv2
which replaces[%getenv SOME_VAR]
with the value of the environment variableSOME_VAR
at compile time.ppx_yojson
which allows you to writeYojson
values using OCaml syntax to mimic actual json. For instance you'd use[%yojson {a = None; b = 1}]
to represent{"a": null, "b": 1}
instead of theYojson
's notation:Assoc [("a", Null); ("b", Int 1)]
.
Derivers
Derivers or deriving plugins will "insert" new nodes derived from type definitions annotated with a
[@@deriving <deriver_name>]
attribute. They have various applications but are particularly useful
to derive functions that are tedious and error prone to write by hand such as comparison functions,
pretty printers or serializers. It's really convenient as you don't have to update those functions
every time you update your type definitions. They were inspired by Haskell Type classes. Good
examples of derivers are:
ppx_deriving
itself comes with a bunch of deriving plugins such aseq
,ord
orshow
which respectively derives, as you might have guessed, equality, comparison and pretty-printing functions.ppx_deriving_yojson
which derives JSON serializers and deserializers.ppx_sexp_conv
which derives s-expressions converters.
Derivers often let you attach attributes to specify how some parts of the AST should be handled. For
example when using ppx_deriving_yojson
you can use [@default some_val]
to make a field of an
object optional:
type t =
{ a: int
; b: string [@default ""]
}
[@@deriving of_yojson]
will derive a deserializer that will convert the JSON value {"a": 1}
to the OCaml
{a = 1; b = ""}
How to write a PPX using ppxlib
Historically there was a few libraries used by PPX rewriter authors to write their PPX-es, including
ppx_tools
and ppx_deriving
but as the eco-system evolved, ppxlib
emerged and is now the most
up-to-date and maintained library to write and handle PPX-es. It wraps the features of those
libraries in a single one.
I encourage you to use ppxlib
to write new PPX-es as it is also easier to make various rewriters
work together if they are all registered through ppxlib
and the PPX ecosystem would gain from
being unified around a single PPX library and driver.
It is also a great library and has some really powerful features to help you write your extensions and derivers.
Writing an extension
The entry point of ppxlib
for extensions is Ppxlib.Extension.declare
. You have to use that
function to build an Extension.t
, from which you can then build a Context_free.Rule.t
before
registering your transformation so it's actually applied.
The typical my_ppx_extension.ml
will look like:
open Ppxlib
let extension =
Extension.declare
"my_extension"
some_context
some_pattern
expand_function
let rule = Context_free.Rule.extension extension
let () =
Driver.register_transformation ~rules:[rule] "my_transformation"
To compile it as PPX rewriter you'll need to put the following in your dune file:
(library
(public_name my_ppx)
(kind ppx_rewriter)
(libraries ppxlib))
Now let's go back a little and look at the important part:
let extension =
Extension.declare
"my_extension"
some_context
some_pattern
expand_function
Here "my_extension"
is the name of your extension and that define how you're going to invoke it
in your extension point. In other words, to use this extension in our code we'll use a
[%my_extension ...]
extension point.
some_context
is a Ppxlib.Extension.Context.t
and describes where this extension can be found in
the AST, ie can you use [%my_extension ...]
as an expression, a pattern, a core type. The
Ppxlib.Extension.Context
module defines a constant for each possible extension context which you
can pass as some_context
.
This obviously means that it also describes the type of AST node to which it must be converted and
this property is actually enforced by the some_pattern
argument. But we'll come back to that
later.
Finally expand_function
is our actual extension implementation, which basically takes the payload,
a loc
argument which contains the location of the expanded extension point, a path
argument
which is the fully qualified path to the expanded node (eg. "file.ml.A.B"
) and returns the
generated code to replace the extension with.
Ast_pattern
Now let's get back to that some_pattern
argument.
This is one of the trickiest parts of ppxlib
to understand but it's also one its most
powerful features. The type for Ast_pattern
is defined as ('a, 'b, 'c) t
where 'a
is
the type of AST nodes that are matched, 'b
is the type of the values you're extracting from the
node as a function type and 'c
is the return type of that last function. This sounded really
confusing to me at first and I'm guessing it might do to some of you too so let's give it a bit of
context.
Let's look at the type of Extension.declare
:
val declare :
string ->
'context Context.t ->
(payload, 'a, 'context) Ast_pattern.t ->
(loc:Location.t -> path:string -> 'a) ->
t
Here, the expected pattern first type parameter is payload
which means we want a pattern that
matches payload
AST nodes. That makes perfect sense since it is used to describe what your
extension's payload should look like and what to do with it.
The last type parameter is 'context
which again seems logical. As I mentioned earlier our
expand_function
should return the same kind of node as the one where the extension was found.
Now what about 'a
. As you can see, it describes what comes after the base loc
and path
parameters of our expand_function
. From the pattern point of view, 'a
describes the parts of the
matched AST node we wish to extract for later consumption, here by our expander.
Ast_pattern
contains a whole bunch of combinators to let you describe what your pattern should match
and a specific __
pattern that you must use to capture the various parts of the matched nodes.
__
has type ('a, 'a -> 'b, 'b) Ast_pattern.t
which means that whenever it's used it changes the
type of consumer function in the returned pattern.
Let's consider a few examples to try wrapping our heads around this. Say I want to write an extension that takes an expression as a payload and I want to pass this expression to my expander so I can generate code based on its value. I can declare the extension like this:
let extension =
Extension.declare
"my_extension"
Extension.Context.expression
Ast_pattern.(single_expr_payload __)
expand_function
In this example, Extension.Context.expression
has type expression Extension.Context.t
, the
pattern has type (payload, expression -> expression, expression) Ast_pattern.t
. The pattern says we
want to allow a single expression in the payload and capture it. If we decompose it a bit, we can
see that single_expr_payload
has type
(expression, 'a, 'b) Ast_pattern.t -> (payload, 'a, 'b) Ast_pattern.t
and is passed __
which
makes it a (expression, expression -> 'b, 'b) Ast_pattern.t
and that's exactly what we want here
as our expander will have type loc: Location.t -> path: string -> expression -> expression
!
It works similarly to Scanf.scanf
when you think about it. Changing the pattern changes the type of the
consumer function the same way changing the format string does for Scanf
functions.
This was a bit easy since we had a custom combinator just for that purpose so let's take a few more
complex examples. Now say we want to only allow pairs of integer and string constants expressions in
our payload. Instead of just capturing any expression and dealing with the error cases in the
expand_function
we can let Ast_pattern
deal with that and pass an int
and string
along to
our expander:
Ast_pattern.(single_expr_payload (pexp_tuple ((eint __)^::(estring __)^::nil)))
This one's a bit more elaborate but the idea is the same, we use __
to capture the int and string
from the expression and use combinators to specify that the payload should be made of a pair and
that gives us a: (payload, int -> string -> 'a, 'a) Ast_pattern.t
which should be used with a
loc: Location.t -> path: string -> int -> string -> expression
expander.
We can also specify that our extension should take something else than an expression as a payload,
say a pattern with no when
clause so that it's applied as [%my_ext? some_pattern_payload]
:
Ast_pattern.(ppat __ none)
or no payload at all and it should just be invoked as [%my_ext]
:
Ast_pattern.(pstr nil)
You should play with Ast_pattern
a bit if you need to express complex patterns as I think it's
the only way to get the hang of it.
Writing a deriver
Registering a deriver is slightly different from registering an extension but in the end it remains
relatively simple and you will still have to provide the actual implementation in the form of an
expand
function.
The typical my_ppx_deriver.ml
will look like:
open Ppxlib
let str_type_decl_generator =
Deriving.Generator.make_no_arg
~attributes
expand_str
let sig_type_decl_generator =
Deriving.Generator.make_no_arg
~attributes
expand_sig
let my_deriver =
Deriving.add
~str_type_decl:str_type_decl_generator
~sig_type_decl:sig_type_decl_generator
"my_deriver"
Which you'll need to compile with the following library
stanza:
(library
(public_name my_ppx)
(kind ppx_deriver)
(libraries ppxlib))
The Deriving.add
function is declared as:
val add
: ?str_type_decl:(structure, rec_flag * type_declaration list) Generator.t
-> ?str_type_ext :(structure, type_extension ) Generator.t
-> ?str_exception:(structure, extension_constructor ) Generator.t
-> ?sig_type_decl:(signature, rec_flag * type_declaration list) Generator.t
-> ?sig_type_ext :(signature, type_extension ) Generator.t
-> ?sig_exception:(signature, extension_constructor ) Generator.t
-> ?extension:(loc:Location.t -> path:string -> core_type -> expression)
-> string
-> t
It takes a mandatory string argument, here "my_deriver"
, which defines how
user are going to invoke your deriver. In this case we'd need to add a [@@deriving my_deriver]
to
a type declaration in a structure or a signature to use it.
Then there's just one optional argument per kind of node to which you can attach a [@@deriving ...]
attribute. type_decl
correspond to type = ...
, type_ext
to type += ...
and exception
to
exception My_exc of ...
.
You need to provide generators for the ones you wish your deriver to handle, ppxlib
will make sure users get a compile error if they try to use it elsewhere.
We can ignore the extension
as it's just here for compatibility with ppx_deriving
.
Now let's take a look at Generator
. Its type is defined as ('output_ast, 'input_ast) t
where
'input_ast
is the type of the node to which the [@@deriving ...]
is attached and 'output_ast
the type of the nodes it should produce, ie either a structure
or a signature
. The type of a
generator depends on the expand function it's built from when you use the smart constructor
make_no_arg
meaning the expand function should have type
loc: Location.t -> path: string -> 'input_ast -> 'output_ast
. This function is the actual
implementation of your deriver and will generate the list of structure_item
or signature_item
from the type declaration.
Compatibility with ppx_import
ppx_import
is a PPX rewriter that lets you import type
definitions and spares you the need to copy and update them every time they change upstream. The
main reason why you would want to do that is because you need to derive values from those types
using a deriver thus the importance of ensuring your deriving plugin is compatible.
Let's take an example to illustrate how ppx_import
is used. I'm using a library called blob
which exposes a type Blob.t
. For some reason I need to be able to serialize and deserialize
Blob.t
values to JSON. I'd like to use a deriver to do that as I don't want to maintain that code
myself. Imagine Blob.t
is defined as:
type t =
{ value : string
; length : int
; id : int
}
Without ppx_import
I would define somewhere a serializable_blob
type as follows:
type serializable_blob = Blob.t =
{ value : string
; length : int
; id : int
}
[@@deriving yojson]
That works well especially because the type definition is simple but I don't really care about
having it here, what I really want is just the to_yojson
and of_yojson
functions. Also now, if
the type definition changes, I have to update it here manually. Maintaining many such imports can be
tedious and duplicates a lot of code unnecessarily.
What I can do instead, thanks to ppx_import
is to write it like this:
type serializable_blob = [%import: Blob.t]
[@@deriving yojson]
which will ultimately be expanded into the above using Blob
's definition of the type t
.
Now ppx_import
works a bit differently from regular PPX rewriters as it needs a bit more information
than just the AST. We don't need to understand how it works but what it means is that if your
deriving plugin is used with ppx_import
, it will be called twice:
- A first time with
ocamldep
. This is required to determine the dependencies of a module in terms of other OCaml modules. PPX-es need to be applied here to find out about dependencies they may introduce. - A second time before actually compiling the code.
The issue here is that during the ocamldep
pass, ppx_import
doesn't have the information it
needs to import the type definition yet so it can't copy it and it expands:
type u = [%import A.t]
into:
type u = A.t
Only during the second pass will it actually expand it to the copied type definition.
This may be a concern if your deriving plugin can't apply to abstract types because you will probably raise an error when encountering one, meaning the first phase will fail and the whole compilation will fail without giving your rewriter a chance to derive anything from the copied type definition.
The right way to deal with this is to have different a behaviour in the context of ocamldep
.
In this case you can ignore such type declaration or eventually, if you know you are going to
inject new dependencies in your generated code, to create dummy values referencing them and just
behave normally in any other context.
ppxlib
versions 0.6.0
and higher allow you to do so through the Deriving.Generator.V2
API
which passes an abstract ctxt
value to your expand
function instead of a loc
and a path
.
You can tell whether it is the ocamldep
pass from within the expand
function like this:
open Ppxlib
let expand ~ctxt input_ast =
let omp_config = Expansion_context.Deriver.omp_config ctxt in
let is_ocamldep_pass = String.equal "ocamldep" omp_config.Migrate_parsetree.Driver.tool_name in
...
Deriver attributes
You'll have noted the attributes
parameter in the examples. It's an optional parameter that lets
you define which attributes your deriver allows the user to attach to various bits of the type,
type extension or exception declaration it is applied to.
ppxlib
comes with a Attribute
module that lets you to properly declare the attributes you want
to allow and make sure they are properly used: correctly spelled, placed and with the right
payload attached. This is especially useful since attributes are by default ignored by the compiler
meaning without ppxlib
's care, plugin users wouldn't get any errors if they misused an attribute
and it might take them a while to figure out they got it wrong and the generated code wasn't
impacted as they hoped.
The Attribute
module offers another great feature: Attribute.t
values can be used to extract the
attribute payload from an AST node if it is present. That will spare you the need for
inspecting attributes yourself which can prove quite tedious.
Ppxlib.Attribute.t
is defined as ('context, 'payload) t
where 'context
describes to which node
the attribute can be attached and 'payload
, the type of its payload.
To build such an attribute you must use Ppxlib.Attribute.declare
:
val declare
: string
-> 'a Context.t
-> (payload, 'b, 'c) Ast_pattern.t
-> 'b
-> ('a, 'c) t
Let's try to declare the default
argument from ppx_deriving_yojson
I mentioned earlier.
The first string
argument is the attribute name. ppxlib
support namespaces for the attributes so
that users can avoid conflicting attributes between various derivers applied to the same type
definitions. For instance here we could use "default"
. It can prove helpful to use more qualified
name such as "ppx_deriving_yojson.of_yojson.default"
. That means that our attribute can be used as
[@@default ...]
, [@@of_yojson.default ...]
or [@@ppx_deriving.of_yojson.default ...]
.
Now if another deriver uses a [@@default ...]
, users can apply both derivers and provide different
default
values to the different derivers by writing:
type t =
{ a : int
; b : string [@make.default "abc"] [@of_yojson.default ""]
}
[@@deriving make,of_yojson]
The context argument works very similarly to the one in Extension.declare
. Here we want the
attribute to be attached to record field declarations so we'll use
Attribute.Context.label_declaration
which has type label_declaration Attribute.Context.t
.
The pattern argument is an Ast_pattern.t
. Now that we know how to work with those this is pretty
easy. Here we need to accept any expression as a payload since we should be able to apply the
default
attribute to any field, regardless of its type and we want to extract that expression from
the payload so we can use it in our deserializer so let's use
Ast_pattern.(single_expr_payload __)
.
Finally the last 'b
argument has the same type as the pattern consumer function. We can use it to
transform what we extracted using the previous Ast_pattern
but in this case we just want to
keep the expression as we got it so we'll just use the identity function here.
We end up with the following:
let default_attribute =
Attribute.declare
"ppx_deriving_yojson.of_yojson.default"
Attribute.Context.label_declaration
Ast_pattern.(single_expr_payload __)
(fun expr -> expr)
and that gives us a (label_declaration, expression) Attribute.t
.
You can then use it to collect the attribute payload from a label_declaration:
Attribute.get default_attribute label_decl
which will return Some expr
if the attribute was attached to label_decl
or None
otherwise.
Because of their polymorphic nature, attributes need to be packed, ie to be wrapped with a variant
to hide the type parameter, so if you want to pass it to Generator.make_no_arg
you'll have to do
it like this:
let attributes = [Attribute.T default_attribute]
Writing your expand functions
In the two last sections I mentioned expand
functions that would contain the actual deriver
or
extension
implementation but didn't actually said anything about how to write those. It will
depend a lot on the purpose of your PPX rewriter and what you're trying to achieve.
Before writing your PPX you should clearly specify what it should be applied to and what code it
should produce. That will help you declaring the right deriving or extension rewriter and from there
you'll know the type of the expand
functions you have to write which should help.
A good way to proceed is to use the dumpast
tool to pretty print the AST fragments of both the
input of your expander and the output, ie the code it should generate. To take a concrete example,
say you want to write a deriving plugin that generates an equal
function from a type definition.
You can start by running dumpast
on the following file:
type some_record =
{ a : int64
; b : string
}
let equal_some_record r r' = Int64.equal r.a r'.a && String.equal r.b r'.b
That will give you the AST representation of a record type definition and the equal function you want to write so you can figure out how to deconstruct your expander's input to be able to generate the right output.
ppxlib
exposes smart constructors in Ppxlib.Ast_builder.Default
to help you build AST fragments
without having to care too much attributes and such fields as well as some convenience constructors
to keep your code concise and readable.
Another convenience tool ppxlib
exposes to help you build AST fragments is metaquot
. I recently
wrote a bit of documentation about it
here which you
should take a look at but to sum it up metaquot
is a PPX extension allowing you to write AST nodes
using the OCaml syntax they describe instead of the AST types.
Handling code locations in a PPX rewriter
When building AST fragments you should keep in mind that you have to set their location
. Locations
are part of the AST values that describes the position of the corresponding node in your source
file, including the file name and the line number and offset of both the beginning and the end the
code bit they represent.
Because your code was generated after the file was parsed, it doesn't have a location so you need to set it yourself. One could think that it doesn't matter and we could use a dummy location but locations are used by the compiler to properly report errors and that's why a PPX rewriter should care about how it locates the generated code as it will help the end user to understand whether the error comes from their code or generated code and how to eventually fix it.
Both Ast_builder
and metaquot
expect a location. The first explicitly takes it as a labelled
loc
argument while the second relies on a loc
value being available in the scope. It is
important to set those with care as errors in the generated code doesn't necessarily mean that your
rewriter is bugged. There are valid cases where your rewriter functioned as intended but the generated
code triggers an error. PPX-es often work on the assumption that some values are available in the
scope, if the user doesn't properly provide those it's their responsibility to fix the error. To
help them do so, it is important to properly locate the generated code to guide them as much as
possible.
When writing extensions, using the whole extension point location for the generated code makes
perfect sense as that's where the code will sit. That's fairly easy as this what ppxlib
passes
to the expand function through the loc
labelled argument. For deriving plugins it's a bit different
as the generated code doesn't replace an existing part of the parsed AST but generate a new one to insert.
Currently ppxlib
gives you the loc
of the whole type declaration, extension or exception
declaration your deriving plugin is applied to. Ideally it would be nice to be able to locate the
generated code on the plugin name in the deriving
attribute payload, ie here:
[@@deriving my_plugin,another_plugin]
^^^^^^^^^
I'm currently working on making that location available to the expand
function. In the meantime,
you should choose a convention. I personally locate all the generated code on the
type declaration. Some choose to locate the generated code on the part of the input AST they're
handling when generating it.
Reporting errors to your rewriter users
You won't always be able to handle all the AST nodes passed to your expand functions, either because the end user misused your rewriter or because there are some cases you simply can't deal with.
In those cases you can report the error to the user with Ppxlib.Location.raise_errorf
. It works
similarly to printf
and you can build your error message from a format string and extra
arguments. It will then raise an exception which will be caught and reported by the compiler.
A good practice is to prefix the error message with the name of your rewriter to help users understand
what's going on, especially with deriving plugin as they might use several of them on the same type
declaration.
Another point to take care of here is, again, locations. raise_errorf
takes a labelled loc
arguments. It is used so that your error is reported as any compiler error. Having good locations in
those error messages is just as important as sending clear error messages. Keep in mind that both
the errors you report yourself or errors coming from your generated code will be highlighted by
merlin so when properly set they make it much easier to work with your PPX rewriter.
Testing your PPX
Just as most pieces of code do, a PPX deserves to be tested and it has become easier over the years to test rewriters.
I personally tend to write as many unit test as possible for my PPX-es internal libraries. I try to
extract helper functions that can easily be unit-tested but I can't test it all that way.
Testing the ast -> ast
functions would be tedious as ppxlib
and ocaml-migrate-parsetree
don't provide comparison and pretty printing functions that you can use with alcotest
or oUnit
.
That means you'd have to import the AST types and derive them on your own. That would make a lot
of boiler plate and even if those functions were exposed, writing such tests would be really
tedious. There's a lot of things to take into account. How are you going to build the input AST values
for instance? If you use metaquot
, every node will share the same loc, making it hard to test
that your errors are properly located. If you don't, you will end up with insanely long and
unreadable test code or fixtures.
While that would allow extremely accurate testing for the generated code and errors, it will almost
certainly make your test code unmaintainable, at least given the current tooling.
Don't panic, there is a very good and simple alternative. ppxlib
makes it very easy to build a
binary that will parse OCaml code, preprocess the AST with your rewriter and spit it out, formatted as
code again.
You just have to write the following pp.ml
:
let () = Ppxlib.Driver.standalone ()
and build the binary with the following dune
stanza, assuming your rewriter is called
my_ppx_rewriter
:
(executable
(name pp)
(modules pp)
(libraries my_ppx_rewriter ppxlib))
Because we're humans and the OCaml syntax is meant for us to write and read, it makes for much better
test input/output. You can now write your test input in a regular .ml
file, use the pp.exe
binary to "apply" your preprocessor to it and compare the output with another .ml
file containing
the code you expect it to generate. This kind of test pattern is really well supported by dune
thanks to the diff
user action.
I usually have the following files in a rewriter
/deriver
folder within my test directory:
test/rewriter/
├── dune
├── test.expected.ml
├── pp.ml
└── test.ml
Where pp.ml
is used to produce the rewriter binary, test.ml
contains the input OCaml code and
test.expected.ml
the result of preprocessing test.ml
. The dune file content is generally similar
to this:
(executable
(name pp)
(modules pp)
(libraries my_ppx_rewriter ppxlib))
(rule
(targets test.actual.ml)
(deps (:pp pp.exe) (:input test.ml))
(action (run ./%{pp} -deriving-keep-w32 both --impl %{input} -o %{targets})))
(alias
(name runtest)
(action (diff test.expected.ml test.actual.ml)))
(test
(name test)
(modules test)
(preprocess (pps my_ppx_rewriter)))
The first stanza is the one I already introduced above and specifies how to build the rewriter binary.
The rule
stanza that comes after that indicates to dune
how to produce the actual test output by
applying the rewriter binary to test.ml
. You probably noticed the -deriving-keep-w32 both
CLI
option passed to pp.exe
. By default, ppxlib
will generate values or add attributes so that your
generated code doesn't trigger a "Unused value" warning. This is useful in real life situation but
here it will just pollute the test output and make it harder to read so we disable that feature.
The following alias
stanza is where all the magic happens. Running dune runtest
will now
generate test.actual.ml
and compare it to test.expected.ml
. It will not only do that but show
you how they differ from each other in a diff format. You can then automatically update
test.expected.ml
if you're happy with the results by running dune promote
.
Finally the last test
stanza is there to ensure that the generated code compiles without type
errors.
This makes a very convenient test setup to write your PPX-es TDD style. You can start by writing an
identity PPX, that will just return its input AST as it is. Then you add some OCaml code using your
soon to be PPX in test.ml
and run dune runtest --auto-promote
to prefill test.expected.ml
.
From there you can start implementing your rewriter and run dune runtest
to check on your progress
and update the expected result with dune promote
.
Going pure TDD by writing the test works but it's tricky cause you'd have to format your code the
same way pp.exe
will format the AST. It would be great to be able to specify how to format
the generated test.actual.ml
so that this approach would be more viable and the diff more
readable. Being able to use ocamlformat with a very diff friendly configuration would be great
there. pp.exe
seems to offer CLI options to change the code style such as -styler
but I haven't
had the chance to experiment with those yet.
Now you can test successful rewriting this way but what about errors? There's a lot of value ensuring you produce the right errors and on the right code location because that's the kind of things you can get wrong when refactoring your rewriter code or when people try to contribute. That isn't as likely to happen if your CI yells when you break the error reporting. So how do we do that?
Well pretty much the exact same way! We write a file with an erroneous invocation of our rewriter,
run pp.exe
on it and compare stderr with what we expect it to be.
There are two major differences here. First we want to collect the stderr output of the rewriter
binary instead of using it to generate a file. The second is that we cant write all of our test
cases in a single file since pp.exe
will stop at the first error. That means we need one .ml
file per error test case.
Luckily for us, dune offers ways to do both.
For every error test file we will want to add the following stanzas:
(rule
(targets test_error.actual)
(deps (:pp pp.exe) (:input test_error.ml))
(action
(with-stderr-to
%{targets}
(bash "./%{pp} -no-color --impl %{input} || true")
)
)
)
(alias
(name runtest)
(action (diff test_error.expected test_error.actual))
)
but obviously we don't want to do that by hand every time we add a new test case so we're gonna need
a script to generate those stanzas and then include them into our dune
file using
(include dune.inc)
.
To achieve that while keeping things as clean as possible I use the following directory structure:
test/rewriter/
├── errors
│ ├── dune
│ ├── dune.inc
│ ├── gen_dune_rules.ml
│ ├── pp.ml
│ ├── test_some_error.expected
│ ├── test_some_error.ml
│ ├── test_some_other_error.expected
│ └── test_some_other_error.ml
├── dune
├── test.expected.ml
├── pp.ml
└── test.ml
Compared to our previous setup, we only added the new errors
folder. To keep things simple it has
its own pp.ml
copy but in the future I'd like to improve it a bit and be able to use the same
pp.exe
binary.
The most important files here are gen_dune_rules.ml
and dune.inc
. The first is just a simple
OCaml script to generate the above stanzas for each test cases in the errors
directory. The second
is the file we'll include in the main dune
. It's also the file to which we'll write the generated
stanza.
I personally use the following gen_dune_rules.ml
:
let output_stanzas filename =
let base = Filename.remove_extension filename in
Printf.printf
{|
(library
(name %s)
(modules %s)
(preprocess (pps ppx_yojson))
)
(rule
(targets %s.actual)
(deps (:pp pp.exe) (:input %s.ml))
(action
(with-stderr-to
%%{targets}
(bash "./%%{pp} -no-color --impl %%{input} || true")
)
)
)
(alias
(name runtest)
(action (diff %s.expected %s.actual))
)
|}
base
base
base
base
base
base
let is_error_test = function
| "pp.ml" -> false
| "gen_dune_rules.ml" -> false
| filename -> Filename.check_suffix filename ".ml"
let () =
Sys.readdir "."
|> Array.to_list
|> List.sort String.compare
|> List.filter is_error_test
|> List.iter output_stanzas
Nothing spectacular here, we just build the list of all the .ml
files in the directory except
pp.ml
and gen_dune_rules.ml
itself and then generate the right stanzas for each of them. You'll
note the extra library
stanza which I add to get dune to generate the right .merlin
so that I
can see the error highlights when I edit the files by hand.
With that we're almost good, add the following to the dune
file and you're all set:
(executable
(name pp)
(modules pp)
(libraries
ppx_yojson
ppxlib
)
)
(include dune.inc)
(executable
(name gen_dune_rules)
(modules gen_dune_rules)
)
(rule
(targets dune.inc.gen)
(deps
(:gen gen_dune_rules.exe)
(source_tree .)
)
(action
(with-stdout-to
%{targets}
(run %{gen})
)
)
)
(alias
(name runtest)
(action (diff dune.inc dune.inc.gen))
)
The first stanza is here to specify how to build the rewriter binary, same as before, while the
second stanza just tells dune to include the content of dune.inc
within this dune
file.
The interesting part comes next. As you can guess the executable
stanza builds our little OCaml
script into a .exe
. The rule
that comes after that specifies how to generate the new stanzas
by running gen_dune_rules
and capturing its standard output into a dune.inc.gen
file.
The last rule allows you to review the changes to the generated stanza and use promotion to accept
them. Once this is done, the new stanzas will be included to the dune
file and the test will be
run for every test cases.
Adding a new test case is then pretty easy, you can simply run:
$ touch test/rewriter/errors/some_explicit_test_case_name.{ml,expected} && dune runtest --auto-promote
That will create the new empty test case and update the dune.inc
with the corresponding rules.
From there you can proceed the same way as with the successful rewriting tests, update the .ml
,
run dune runtest
to take a sneak peek at the output and dune promote
once you're satisfied with
the result.
I've been pretty happy with this setup so far although there's room for improvement. It would be
nice to avoid duplicating pp.ml
for errors testing. This also involves
quite a bit of boilerplate that I have to copy into all my PPX rewriters repositories every time.
Hopefully dune plugins should help with that and I
can't wait for a first version to be released so that I can write a plugin to make this test
pattern more accessible and easier to set up.
Open-Source Development
Tarides champions open-source development. We create and maintain key features of the OCaml language in collaboration with the OCaml community. To learn more about how you can support our open-source work, discover our page on GitHub.
Stay Updated on OCaml and MirageOS!
Subscribe to our mailing list to receive the latest news from Tarides.
By signing up, you agree to receive emails from Tarides. You can unsubscribe at any time.