Mr. MIME - Parse and generate emails
Software Engineer
We're glad to announce the first release of mrmime
, a parser and a
generator of emails. This library provides an OCaml way to analyze and craft
an email. The eventual goal is to build an entire unikernel-compatible stack
for email (such as SMTP or IMAP).
In this article, we will show what is currently possible with mrmime
and
present a few of the useful libraries that we developed along the way.
An email parser
Some years ago, Romain gave a talk about what an email really is. Behind the human-comprehensible format (or rich-document as we said a long time ago), there are several details of emails which complicate the process of analyzing them (and can be prone to security lapses). These details are mostly described by three RFCs:
Even though they are cross-compatible, providing full legacy email parsing is an archaeological exercise: each RFC retains support for the older design decisions (which were not recognized as bad or ugly in 1970 when they were first standardized).
The latest email-related RFC (RFC5322) tried to fix the issue and provide a better formal specification of the email format – but of course, it comes with plenty of obsolete rules which need to be implemented. In the standard, you find both the current grammar rule and its obsolete equivalent.
An extended email parser
Even if the email format can defined by "only" 3 RFCs, you will miss email internationalization (RFC6532), the MIME format (RFC2045, RFC2046, RFC2047, RFC2049), or certain details needed to be interoperable with SMTP (RFC5321). There are still more RFCs which add extra features to the email format such as S/MIME or the Content-Disposition field.
Given this complexity, we took the most general RFCs and tried to provide an easy way to deal with them. The main difficulty is the multipart parser, which deals with email attachments (anyone who has tried to make an HTTP 1.1 parser knows about this).
A realistic email parser
Respecting the rules described by RFCs is not enough to be able to analyze any
email from the real world: existing email generators can, and do, produce
non-compliant email. We stress-tested mrmime
by feeding it a batch of 2
billion emails taken from the wild, to see if it could parse everything (even if
it does not produce the expected result). Whenever we noticed a recurring
formatting mistake, we updated the details of the ABNF to enable
mrmime
to parse it anyway.
A parser usable by others
One demonstration of the usability of mrmime
is ocaml-dkim
, which wants to
extract a specific field from your mail and then verify that the hash and signature
are as expected.
ocaml-dkim
is used by the latest implementation of ocaml-dns
to request
public keys in order to verify email.
The most important question about ocaml-dkim
is: is it able to
verify your email in one pass? Indeed, currently some implementations of DKIM
need 2 passes to verify your email (one to extract the DKIM signature, the other
to digest some fields and bodies). We focused on verifying in a single pass in
order to provide a unikernel SMTP relay with no need to store your email between
verification passes.
An email generator
OCaml is a good language for making little DSLs for specialized use-cases. In this case, we took advantage of OCaml to allow the user to easily craft an email from nothing.
The idea is to build an OCaml value describing the desired email header, and then let the Mr. MIME generator transform this into a stream of characters that can be consumed by, for example, an SMTP implementation. The description step is quite simple:
#require "mrmime" ;;
#require "ptime.clock.os" ;;
open Mrmime
let romain_calascibetta =
let open Mailbox in
Local.[ w "romain"; w "calascibetta" ] @ Domain.(domain, [ a "gmail"; a "com" ])
let john_doe =
let open Mailbox in
Local.[ w "john" ] @ Domain.(domain, [ a "doe"; a "org" ])
|> with_name Phrase.(v [ w "John"; w "D." ])
let now () =
let open Date in
of_ptime ~zone:Zone.GMT (Ptime_clock.now ())
let subject =
Unstructured.[ v "A"; sp 1; v "Simple"; sp 1; v "Mail" ]
let header =
let open Header in
Field.(Subject $ subject)
& Field.(Sender $ romain_calascibetta)
& Field.(To $ Address.[ mailbox john_doe ])
& Field.(Date $ now ())
& empty
let stream = Header.to_stream header
let () =
let rec go () =
match stream () with
| Some buf -> print_string buf; go ()
| None -> ()
in
go ()
This code produces the following header:
Date: 2 Aug 2019 14:10:10 GMT
To: John "D." <john@doe.org>
Sender: romain.calascibetta@gmail.com
Subject: A Simple Mail
78-character rule
One aspect about email and SMTP is about some historical rules of how to generate them. One of them is about the limitation of bytes per line. Indeed, a generator of mail should emit at most 80 bytes per line - and, of course, it should emits entirely the email line per line.
So mrmime
has his own encoder which tries to wrap your mail into this limit.
It was mostly inspired by Faraday and Format powered with
GADT to easily describe how to encode/generate parts of an email.
A multipart email generator
Of course, the main point about email is to be able to generate a multipart
email - just to be able to send file attachments. And, of course, a deep work
was done about that to make parts, compose them into specific Content-Type
fields and merge them into one email.
Eventually, you can easily make a stream from it, which respects rules (78 bytes per line, stream line per line) and use it directly into an SMTP implementation.
This is what we did with the project facteur
. It's a little
command-line tool to send with file attachement mails in pure OCaml - but it
works only on an UNIX operating system for instance.
Behind the forest
Even if you are able to parse and generate an email, more work is needed to get the expected results.
Indeed, email is a exchange unit between people and the biggest deal on that is to find a common way to ensure a understable communication each others. About that, encoding is probably the most important piece and when a French person wants to communicate with a latin1 encoding, an American person can still use ASCII.
Rosetta
So about this problem, the choice was made to unify any contents to UTF-8 as the
most general encoding of the world. So, we did some libraries which map an encoding flow
to Unicode code-point, and we use uutf
(thanks to dbuenzli) to normalize it to UTF-8.
The main goal is to avoid a headache to the user about that and even if contents of the mail is encoded with latin1 we ensure to translate it correctly (and according RFCs) to UTF-8.
This project is rosetta
and it comes with:
Pecu and Base64
Then, bodies can be encoded in some ways, 2 precisely (if we took the main standard):
- A base64 encoding, used to store your file
- A quoted-printable encoding
So, about the base64
package, it comes with a sub-package base64.rfc2045
which respects the special case to encode a body according RFC2045 and SMTP
limitation.
Then, pecu
was made to encode and decode quoted-printable contents. It was
tested and fuzzed of course like any others MirageOS's libraries.
These libraries are needed for an other historical reason which is: bytes used to store mail should use only 7 bits instead of 8 bits. This is the purpose of the base64 and the quoted-printable encoding which uses only 127 possibilities of a byte. Again, this limitation comes with SMTP protocol.
Conclusion
mrmime
is tackling the difficult task to parse and generate emails according to 50 years of usability, several RFCs and legacy rules.
So, it
still is an experimental project. We reach the first version of it because we
are currently able to parse many mails and then generate them correctly.
Of course, a bug (a malformed mail, a server which does not respect standards or a bad use of our API) can appear easily where we did not test everything. But we have the feeling it was the time to release it and let people to use it.
The best feedback about mrmime
and the best improvement is yours. So don't be
afraid to use it and start to hack your emails with it.
Open-Source Development
Tarides champions open-source development. We create and maintain key features of the OCaml language in collaboration with the OCaml community. To learn more about how you can support our open-source work, discover our page on GitHub.
Stay Updated on OCaml and MirageOS!
Subscribe to our mailing list to receive the latest news from Tarides.
By signing up, you agree to receive emails from Tarides. You can unsubscribe at any time.