Mr. MIME - Parse and generate emailsby Romain Calascibetta on Sep 25th, 2019
We're glad to announce the first release of
mrmime, a parser and a
generator of emails. This library provides an OCaml way to analyze and craft
an email. The eventual goal is to build an entire unikernel-compatible stack
for email (such as SMTP or IMAP).
In this article, we will show what is currently possible with
present a few of the useful libraries that we developed along the way.
Some years ago, Romain gave a talk about what an email really is. Behind the human-comprehensible format (or rich-document as we said a long time ago), there are several details of emails which complicate the process of analyzing them (and can be prone to security lapses). These details are mostly described by three RFCs:
Even though they are cross-compatible, providing full legacy email parsing is an archaeological exercise: each RFC retains support for the older design decisions (which were not recognized as bad or ugly in 1970 when they were first standardized).
The latest email-related RFC (RFC5322) tried to fix the issue and provide a better formal specification of the email format – but of course, it comes with plenty of obsolete rules which need to be implemented. In the standard, you find both the current grammar rule and its obsolete equivalent.
Even if the email format can defined by "only" 3 RFCs, you will miss email internationalization (RFC6532), the MIME format (RFC2045, RFC2046, RFC2047, RFC2049), or certain details needed to be interoperable with SMTP (RFC5321). There are still more RFCs which add extra features to the email format such as S/MIME or the Content-Disposition field.
Given this complexity, we took the most general RFCs and tried to provide an easy way to deal with them. The main difficulty is the multipart parser, which deals with email attachments (anyone who has tried to make an HTTP 1.1 parser knows about this).
Respecting the rules described by RFCs is not enough to be able to analyze any
email from the real world: existing email generators can, and do, produce
non-compliant email. We stress-tested
mrmime by feeding it a batch of 2
billion emails taken from the wild, to see if it could parse everything (even if
it does not produce the expected result). Whenever we noticed a recurring
formatting mistake, we updated the details of the ABNF to enable
mrmime to parse it anyway.
One demonstration of the usability of
ocaml-dkim, which wants to
extract a specific field from your mail and then verify that the hash and signature
are as expected.
ocaml-dkim is used by the latest implementation of
ocaml-dns to request
public keys in order to verify email.
The most important question about
ocaml-dkim is: is it able to
verify your email in one pass? Indeed, currently some implementations of DKIM
need 2 passes to verify your email (one to extract the DKIM signature, the other
to digest some fields and bodies). We focused on verifying in a single pass in
order to provide a unikernel SMTP relay with no need to store your email between
OCaml is a good language for making little DSLs for specialized use-cases. In this case, we took advantage of OCaml to allow the user to easily craft an email from nothing.
The idea is to build an OCaml value describing the desired email header, and then let the Mr. MIME generator transform this into a stream of characters that can be consumed by, for example, an SMTP implementation. The description step is quite simple:
#require "mrmime" ;; #require "ptime.clock.os" ;; open Mrmime let romain_calascibetta = let open Mailbox in Local.[ w "romain"; w "calascibetta" ] @ Domain.(domain, [ a "gmail"; a "com" ]) let john_doe = let open Mailbox in Local.[ w "john" ] @ Domain.(domain, [ a "doe"; a "org" ]) |> with_name Phrase.(v [ w "John"; w "D." ]) let now () = let open Date in of_ptime ~zone:Zone.GMT (Ptime_clock.now ()) let subject = Unstructured.[ v "A"; sp 1; v "Simple"; sp 1; v "Mail" ] let header = let open Header in Field.(Subject $ subject) & Field.(Sender $ romain_calascibetta) & Field.(To $ Address.[ mailbox john_doe ]) & Field.(Date $ now ()) & empty let stream = Header.to_stream header let () = let rec go () = match stream () with | Some buf -> print_string buf; go () | None -> () in go ()
This code produces the following header:
Date: 2 Aug 2019 14:10:10 GMT To: John "D." <firstname.lastname@example.org> Sender: email@example.com Subject: A Simple Mail
One aspect about email and SMTP is about some historical rules of how to generate them. One of them is about the limitation of bytes per line. Indeed, a generator of mail should emit at most 80 bytes per line - and, of course, it should emits entirely the email line per line.
mrmime has his own encoder which tries to wrap your mail into this limit.
It was mostly inspired by Faraday and Format powered with
GADT to easily describe how to encode/generate parts of an email.
Of course, the main point about email is to be able to generate a multipart
email - just to be able to send file attachments. And, of course, a deep work
was done about that to make parts, compose them into specific
fields and merge them into one email.
Eventually, you can easily make a stream from it, which respects rules (78 bytes per line, stream line per line) and use it directly into an SMTP implementation.
This is what we did with the project
facteur. It's a little
command-line tool to send with file attachement mails in pure OCaml - but it
works only on an UNIX operating system for instance.
Even if you are able to parse and generate an email, more work is needed to get the expected results.
Indeed, email is a exchange unit between people and the biggest deal on that is to find a common way to ensure a understable communication each others. About that, encoding is probably the most important piece and when a French person wants to communicate with a latin1 encoding, an American person can still use ASCII.
So about this problem, the choice was made to unify any contents to UTF-8 as the
most general encoding of the world. So, we did some libraries which map an encoding flow
to Unicode code-point, and we use
uutf (thanks to dbuenzli) to normalize it to UTF-8.
The main goal is to avoid a headache to the user about that and even if contents of the mail is encoded with latin1 we ensure to translate it correctly (and according RFCs) to UTF-8.
This project is
rosetta and it comes with:
Then, bodies can be encoded in some ways, 2 precisely (if we took the main standard):
- A base64 encoding, used to store your file
- A quoted-printable encoding
So, about the
base64 package, it comes with a sub-package
which respects the special case to encode a body according RFC2045 and SMTP
pecu was made to encode and decode quoted-printable contents. It was
tested and fuzzed of course like any others MirageOS's libraries.
These libraries are needed for an other historical reason which is: bytes used to store mail should use only 7 bits instead of 8 bits. This is the purpose of the base64 and the quoted-printable encoding which uses only 127 possibilities of a byte. Again, this limitation comes with SMTP protocol.
mrmime is tackling the difficult task to parse and generate emails according to 50 years of usability, several RFCs and legacy rules.
still is an experimental project. We reach the first version of it because we
are currently able to parse many mails and then generate them correctly.
Of course, a bug (a malformed mail, a server which does not respect standards or a bad use of our API) can appear easily where we did not test everything. But we have the feeling it was the time to release it and let people to use it.
The best feedback about
mrmime and the best improvement is yours. So don't be
afraid to use it and start to hack your emails with it.