Mr. MIME - Parse and generate emails

by Romain Calascibetta on Sep 25th, 2019

We're glad to announce the first release of mrmime, a parser and a generator of emails. This library provides an OCaml way to analyze and craft an email. The eventual goal is to build an entire unikernel-compatible stack for email (such as SMTP or IMAP).

In this article, we will show what is currently possible with mrmime and present a few of the useful libraries that we developed along the way.

An email parser

Some years ago, Romain gave a talk about what an email really is. Behind the human-comprehensible format (or rich-document as we said a long time ago), there are several details of emails which complicate the process of analyzing them (and can be prone to security lapses). These details are mostly described by three RFCs:

Even though they are cross-compatible, providing full legacy email parsing is an archaeological exercise: each RFC retains support for the older design decisions (which were not recognized as bad or ugly in 1970 when they were first standardized).

The latest email-related RFC (RFC5322) tried to fix the issue and provide a better formal specification of the email format – but of course, it comes with plenty of obsolete rules which need to be implemented. In the standard, you find both the current grammar rule and its obsolete equivalent.

An extended email parser

Even if the email format can defined by "only" 3 RFCs, you will miss email internationalization (RFC6532), the MIME format (RFC2045, RFC2046, RFC2047, RFC2049), or certain details needed to be interoperable with SMTP (RFC5321). There are still more RFCs which add extra features to the email format such as S/MIME or the Content-Disposition field.

Given this complexity, we took the most general RFCs and tried to provide an easy way to deal with them. The main difficulty is the multipart parser, which deals with email attachments (anyone who has tried to make an HTTP 1.1 parser knows about this).

A realistic email parser

Respecting the rules described by RFCs is not enough to be able to analyze any email from the real world: existing email generators can, and do, produce non-compliant email. We stress-tested mrmime by feeding it a batch of 2 billion emails taken from the wild, to see if it could parse everything (even if it does not produce the expected result). Whenever we noticed a recurring formatting mistake, we updated the details of the ABNF to enable mrmime to parse it anyway.

A parser usable by others

One demonstration of the usability of mrmime is ocaml-dkim, which wants to extract a specific field from your mail and then verify that the hash and signature are as expected.

ocaml-dkim is used by the latest implementation of ocaml-dns to request public keys in order to verify email.

The most important question about ocaml-dkim is: is it able to verify your email in one pass? Indeed, currently some implementations of DKIM need 2 passes to verify your email (one to extract the DKIM signature, the other to digest some fields and bodies). We focused on verifying in a single pass in order to provide a unikernel SMTP relay with no need to store your email between verification passes.

An email generator

OCaml is a good language for making little DSLs for specialized use-cases. In this case, we took advantage of OCaml to allow the user to easily craft an email from nothing.

The idea is to build an OCaml value describing the desired email header, and then let the Mr. MIME generator transform this into a stream of characters that can be consumed by, for example, an SMTP implementation. The description step is quite simple:

#require "mrmime" ;;
#require "ptime.clock.os" ;;

open Mrmime

let romain_calascibetta =
  let open Mailbox in
  Local.[ w "romain"; w "calascibetta" ] @ Domain.(domain, [ a "gmail"; a "com" ])

let john_doe =
  let open Mailbox in
  Local.[ w "john" ] @ Domain.(domain, [ a "doe"; a "org" ])
  |> with_name Phrase.(v [ w "John"; w "D." ])

let now () =
  let open Date in
  of_ptime ~zone:Zone.GMT (Ptime_clock.now ())

let subject =
  Unstructured.[ v "A"; sp 1; v "Simple"; sp 1; v "Mail" ]

let header =
  let open Header in
  Field.(Subject $ subject)
  & Field.(Sender $ romain_calascibetta)
  & Field.(To $ Address.[ mailbox john_doe ])
  & Field.(Date $ now ())
  & empty

let stream = Header.to_stream header

let () =
  let rec go () =
    match stream () with
    | Some buf -> print_string buf; go ()
    | None -> ()
  in
  go ()

This code produces the following header:

Date: 2 Aug 2019 14:10:10 GMT
To: John "D." <john@doe.org>
Sender: romain.calascibetta@gmail.com
Subject: A Simple Mail

78-character rule

One aspect about email and SMTP is about some historical rules of how to generate them. One of them is about the limitation of bytes per line. Indeed, a generator of mail should emit at most 80 bytes per line - and, of course, it should emits entirely the email line per line.

So mrmime has his own encoder which tries to wrap your mail into this limit. It was mostly inspired by Faraday and Format powered with GADT to easily describe how to encode/generate parts of an email.

A multipart email generator

Of course, the main point about email is to be able to generate a multipart email - just to be able to send file attachments. And, of course, a deep work was done about that to make parts, compose them into specific Content-Type fields and merge them into one email.

Eventually, you can easily make a stream from it, which respects rules (78 bytes per line, stream line per line) and use it directly into an SMTP implementation.

This is what we did with the project facteur. It's a little command-line tool to send with file attachement mails in pure OCaml - but it works only on an UNIX operating system for instance.

Behind the forest

Even if you are able to parse and generate an email, more work is needed to get the expected results.

Indeed, email is a exchange unit between people and the biggest deal on that is to find a common way to ensure a understable communication each others. About that, encoding is probably the most important piece and when a French person wants to communicate with a latin1 encoding, an American person can still use ASCII.

Rosetta

So about this problem, the choice was made to unify any contents to UTF-8 as the most general encoding of the world. So, we did some libraries which map an encoding flow to Unicode code-point, and we use uutf (thanks to dbuenzli) to normalize it to UTF-8.

The main goal is to avoid a headache to the user about that and even if contents of the mail is encoded with latin1 we ensure to translate it correctly (and according RFCs) to UTF-8.

This project is rosetta and it comes with:

uuuu for ISO-8859 encoding
coin for KOI8-{R,U} encoding
yuscii for UTF-7 encoding

Pecu and Base64

Then, bodies can be encoded in some ways, 2 precisely (if we took the main standard):

A base64 encoding, used to store your file
A quoted-printable encoding

So, about the base64 package, it comes with a sub-package base64.rfc2045 which respects the special case to encode a body according RFC2045 and SMTP limitation.

Then, pecu was made to encode and decode quoted-printable contents. It was tested and fuzzed of course like any others MirageOS's libraries.

These libraries are needed for an other historical reason which is: bytes used to store mail should use only 7 bits instead of 8 bits. This is the purpose of the base64 and the quoted-printable encoding which uses only 127 possibilities of a byte. Again, this limitation comes with SMTP protocol.

Conclusion

mrmime is tackling the difficult task to parse and generate emails according to 50 years of usability, several RFCs and legacy rules. So, it still is an experimental project. We reach the first version of it because we are currently able to parse many mails and then generate them correctly.

Of course, a bug (a malformed mail, a server which does not respect standards or a bad use of our API) can appear easily where we did not test everything. But we have the feeling it was the time to release it and let people to use it.

The best feedback about mrmime and the best improvement is yours. So don't be afraid to use it and start to hack your emails with it.