Making OCaml 5 Succeed for Developers and Organisations

OCaml recently won the ACM SIGPLAN PL Software Award. The award recognises a software system that has had a significant impact on programming language implementation, research, and tools. It is especially notable that 4 out of the 14 named OCaml compiler developers are affiliated with Tarides: Anil, David, Jérôme, and me. In this post, I discuss the wider effort afoot at Tarides in order to make OCaml 5, the latest release of the OCaml programming language, succeed for developers. I should note that I shall specifically focus on the new OCaml 5 features and omit important developments such as Tarides' work on the OCaml platform, which is discussed elsewhere.

I started hacking on OCaml when I joined Anil, Stephen, and Leo (who are also named in this award) at OCaml Labs in the University of Cambridge in 2014 to work on the Multicore OCaml project. The aim of the Multicore OCaml project was to add native support for concurrency and parallelism to the OCaml programming language. Multicore OCaml compiler was maintained as a fork of the OCaml compiler for many years before it merged with the mainline OCaml compiler in January 2022. After almost a year of work stabilising the features, OCaml 5.0 was finally released in December 2022, nearly 8 years after the first commit.

Has the Multicore OCaml project succeeded with the release of OCaml 5.0? The short answer is No. There is a long road to making OCaml 5 succeed for the developers. The goal of making OCaml 5 succeed for the developers is a two-step process:

Help developers transition existing programs to OCaml 5
Help developers take advantage of new concurrency and parallelism features in OCaml 5

Transitioning developers to OCaml 5

Even with the arrival of OCaml 5, most OCaml programs will remain sequential forever. It is important that developers can successfully transition their OCaml projects over to OCaml 5, even if they don't plan to use the new features. We have carefully designed OCaml 5 such that the breaking changes are minimized. In particular, we eschewed a potentially more scalable GC design since it broke the C FFI compatibility (see section 7 "Discussion" in the ICFP 2020 paper on the new GC design). The only breaking changes in OCaml 5 were the removal of the support for naked pointers and the unrelated removal of deprecated functions from the standard library. We released a dynamic detector for naked pointers to help developers find and remove naked pointers from their codebase.

Restoring Unimplemented Features

OCaml 5.0 was an experimental release with many features unimplemented. In particular, OCaml 5.0 only supported the x86 and ARM backends. With impressive efforts from the community, the OCaml maintainers have restored support for All Tier 1 platforms including RISC-V, s390x and Power. Tarides helped implement or review the support for all of these backends. Tarides engineers also restored other important features such as GC mark loop pre-fetching and frame-pointer support for x86 backend. These features will be in OCaml 5.1.

We have also been working on restoring other big-ticket items such as compaction and statmemprof which were not implemented for OCaml 5.0. In OCaml, compaction is the only time when the runtime releases memory allocated for the heap back to the operating system. Many long-running programs have an initialisation phase where they use a lot of memory followed by a steady state phase where they operate for a long time with less memory. It is a common practice to call Gc.compact() after the initialisation phase so that the steady-state memory usage of the program remains low. Without compaction, the steady state will also use as much memory as the peak memory usage. This problem was reported by the Infer team at Meta (who were otherwise able to switch to OCaml 5 easily, thanks to our focus on backwards compatibility).

Tarides engineers have opened a PR for restoring compaction. The compaction feature is slated to be restored in OCaml 5.2. We have also been working on restoring statmemprof, the statistical memory profiler for OCaml. We are hoping to have a PR ready for this in the coming weeks.

Fixing Performance Regressions

OCaml 5 is a major rewrite of the runtime system and comes with a completely new allocator and a garbage collector (GC). As a result, some large OCaml projects such as Frama-C, Pyre, EasyCrypt, and Infer have reported performance regressions. We have been steadily fixing these issues and have not encountered any serious challenges here. Many of the fixes have been incorporated into 5.1, and we expect more performance fixes to land in 5.2. The very fact that large open-source projects can build and test their code on OCaml 5 is itself a testament to our careful backwards-compatible implementation of OCaml 5.

Allocator Performance

A potential source of performance regressions is the allocator. OCaml 5 uses a new parallelism-aware allocator written from scratch and different from the well-performing best-fit allocator available in OCaml since 4.10. Major industrial users of OCaml have reported that best-fit performs better than the earlier first-fit and next-fit allocators. In our benchmarking efforts, we observed that OCaml 5 allocator performs as well as the best-fit allocator, as both allocators utilise size-segmented pages for the allocator. But our benchmarks are admittedly much smaller than industrial OCaml workloads.

In order to derisk the transition to OCaml 5, we have backported the OCaml 5 allocator to OCaml 4 compiler. The backported allocator helps industrial users run their workloads in OCaml 4 with only the allocator changed, which helps identify any regressions. We are working with one of our customers to test the backported allocator on their internal workloads. We hope to identify regressions that only show up at scale and fix them for everyone using OCaml 5.

Continuously Benchmarking Compiler Quality

One of the goals of OCaml 5 is that, for sequential programs, the performance of those programs running on OCaml 5 is no worse than running on OCaml 4. Not only can the developers compile and run their existing sequential code in OCaml 5, but the expectation is that the performance is also similar. To this end, we have been doing nightly benchmarking of compiled code using Sandmark, a benchmarking service consisting of real-world, open-source OCaml programs. Sandmark monitors a multitude of performance parameters related to running time, memory usage, and GC latency.

The benchmarks and the related repository of OCaml packages are constructed in such a way that they can build with both OCaml 4 and OCaml 5. This lets the compiler developers quickly identify any regressions that may be introduced in OCaml 5 with respect to the same code compiled under OCaml 4. Tarides is working to turn this into a GitHub bot that will make it easier for compiler developers to trigger benchmarking runs on development branches.

Better Observability

Another strong reason to move to OCaml 5 from OCaml 4, even if you plan to remain sequential, is the better observability tools that come with OCaml 5. Starting from OCaml 5, the compiler supports a new feature named runtime events, which brings deep introspection capabilities for OCaml programs running in production. Runtime events add a series of probes to the OCaml program that emits data at specific events. This lets the consumers of these events produce interesting insights into the running programs. For example, Olly is a consumer that reports GC statistics including latency distribution. Olly can also produce traces of OCaml program runs visualising the GC behaviours.

An important aspect of runtime events is that the cost of the probes in the fast path (when the probes are not emitting data) is so low that it is available for every OCaml 5 program. In particular, you do not need to recompile your programs with special options to enable event collection. Hence, every OCaml 5 program can be introspected at runtime for interesting events using Olly.

By default, the only probes available are to do with GC events. OCaml 5.1 also brings in support for Custom events, where the user can describe new probes. It unlocks exciting possibilities for application-specific introspection. For example, Meio is a command-line tool that lets the user monitor the status of their application built using Eio, a new concurrency library built using OCaml 5 features, at a per fiber (lightweight task) granularity.

Taking Advantage of OCaml 5 Features

We anticipate two kinds of developers to take advantage of OCaml 5:

Those who want to use the new features in their existing code.
Those who want to write new code using the new features.

There is an increase of positive noise around OCaml recently, which may attract new developers and organisations to OCaml. However, given the millions of lines of existing OCaml code, our aim is to tackle (1) first. We hope that the experience of helping (1) succeed will inform what we should focus on for (2).

Primitive Features

It is important at this point to note that OCaml 5 brings in distinct features for native concurrency and parallelism support in OCaml. For concurrency, OCaml 5 adds effect handlers, and for parallelism, it adds domains to the language. These features are spartan by design, and our aim is to build expressive libraries on top of these features, which will live outside the compiler distribution. The OCaml manual pages on effect handlers and parallelism give a good overview of these primitive features. I also discuss the approach we've taken in retrofitting concurrency to OCaml in the ICFP 2022 Keynote.

Concurrency Libraries

Eio -- I/O Library

For asynchronous, non-blocking I/O, OCaml 4 has two industrial-strength libraries such as Lwt and Async. These libraries simulate concurrency using a monad. They are both very successful, and OCaml code that does asynchronous I/O uses one of these libraries. These libraries do have some downsides in that, due to the use of a monad, they don't produce useful backtraces, and OCaml's built-in exceptions cannot be used. The separation of synchronous and asynchronous code (function colours) and the lack of easy-to-use, higher-kinded polymorphism in OCaml means that one ends up with two versions of useful functions: one for monadic code and another for non-monadic code. This leads to code duplication such as the need to have a separate Lwt's list module. These libraries can continue to be used in OCaml 5, but given that these libraries are not parallelism-safe, one cannot write parallel code that takes advantage of them out of the box.

Eio is a new direct-style I/O library built using effect handlers. It avoids function colouring by using native stacks provided by effect handlers, unlike Lwt and Async which simulate it using a monad. Thanks to this, Eio produces faster code, supports built-in exceptions, produces good backtraces, and avoids code duplication. Eio also is built to be parallelism-safe. Eio provides a generic cross-platform API that can utilise optimised backends on different platforms such as io_uring on Linux.

One particular aspect that I would like to highlight is that Eio provides bridges for Async and Lwt so that existing code can be incrementally translated to Eio. This aspect is crucial for adoption, as we believe that it is impractical to translate a large Lwt or Async codebase over to Eio in one go. Tarides is currently working towards the goal of Eio 1.0, which we expect to be released by Q3 2023. If you are interested in using Eio, Tarides engineers are running a hands-on tutorial on porting Lwt applications over to Eio at ICFP 2023.

Saturn -- Parallel Data Structures

An essential component in the parallel programming toolkit is a library of parallel data structures. A sequential stack or queue data structure is fairly uncontroversial, and it is common to have only a single stack or queue implementation in the language. Indeed, we have a single stack and a single queue data structure in the OCaml standard library. The addition of parallelism brings an explosion of possibilities and challenges:

Correctness -- the addition of concurrency makes it much harder to reason about the correctness of the data structures.
Specialisation -- the performance of the data structure varies widely based on the number of parallel threads accessing the data structure. Hence, it is common to have specialised data structures that are optimised for a limited number of threads and capacity, such as single or multiple producers and consumers to bounded or unbounded queues.
Progress -- Should a pop operation on an empty queue block the caller or should it return immediately with a None? Both options are useful in different circumstances, but supporting one or the other will mean very different tradeoffs and hence, different implementations. Moreover, the non-blocking options are further classified in literature based on the progress in the presence of concurrent operations.
Composability -- In a typical parallel data structure each of the individual operations such as a push or a pop is atomic. What if our application demands that multiple operations be performed atomically? Putting a lock around the entire thing does not often work since it affects performance non-trivially and introduces correctness issues such as deadlocks. There are other mechanisms for well-behaved composition such as software transactional memory.

In other languages, this explosion in the state space often leads to a multitude of concurrency libraries, with overlapping features and different trade-offs, often not clearly labelled. Developers frequently face a challenge choosing the right library with the right trade-off. The correctness of the implementations is also often unclear.

At Tarides, we have been working towards Saturn, a library that brings together all of our efforts at building parallelism-safe libraries. Saturn will consist of lock-free and lock-based, blocking and non-blocking, composable and non-composable parallel data structures under one roof. Each of the different data structures will have a default version that is good enough to be used for parallelism and will have well-documented variants with clearly labelled tradeoffs.

Our composable atomic data structures are built over the kcas library which provides a software transactional memory (STM) on top of lock-free multi-word compare-and-swap (MCAS) primitive. While the kcas library implements MCAS in software efficiently, with the arrival of Power backend in OCaml 5, we plan to explore the promise to utilise hardware transactions for MCAS.

To ensure correctness, Saturn data structures are model-checked using dscheck, an experimental model checker for OCaml that cleverly exploits effect handlers to mock and control parallel scheduling. We also plan to continuously benchmark the data structure to monitor any performance regressions. We expect Saturn to be released in Q3 2023.

Domain-Local Await

With OCaml 5, there are several notions of concurrency:

Domains -- OS threads potentially running in parallel on different cores
Systhreads -- OS threads on a given domain that timeshare a domain
Fibers -- Lightweight, language-level threads implemented by the concurrency library. Each concurrency library may have its own scheduler.

This makes the task of writing blocking data structures, such as blocking channels, challenging because the blocking mechanism is specific to each notion of concurrency. Ideally, we would like to write blocking data structures that are parametric over the blocking mechanism so that we can describe blocking channels once and for all of the different notions of concurrency.

To this end, Tarides has been developing domain-local await (DLA), a scheduler-independent mechanism for blocking. The goal is that concurrency libraries provide the implementation of the DLA interface, and with this, they can use blocking data structures from Saturn. For example, with the implementation of a DLA interface in Eio, it is able to utilise blocking transactions in kcas. By separating out the blocking mechanism from the blocking data structures, different concurrency libraries such as eio and domainslib may communicate easily. At Tarides, we are exploring other scheduler-independent mechanisms for timeout and io.

Multicore Testing Tools

The task of moving a large OCaml codebase to take advantage of new OCaml 5 features may seem daunting. It is likely that none of the existing code was written with concurrency and parallelism in mind. Tarides has been working to empower software engineers with multicore testing tools in order to ease the process of using the new OCaml 5 features.

Thread Sanitizer

When parallelism is introduced in a code base, there is the risk of introducing data races. A data race is said to occur when there are two accesses to a memory location, with at least one of them being a write, and there is no synchronisation between the accesses. For example, the following program:

let r = ref 0
let d = Domain.spawn (fun _ -> r := 1)
let v = !r

has a data race, since the main domain and the newly spawned domain d race to access the reference r, and there is no synchronisation between the accesses.

As a pragmatic language, OCaml encourages the use of mutable state with primitive operations such as reference cells, mutable record fields, arrays, and standard library data structures such as hash tables, stacks, and queues with in-place modification. Thus, it is likely that the addition of parallelism to an OCaml code base will introduce data races.

In C++, the behaviour of a program with data races is undefined. In OCaml, the situation is much better. OCaml programs with data races have well-defined semantics. In particular, a program with data races will not violate type safety and will not crash. That said, the programs with data races may produce behaviours that cannot be explained only by the interleaving of operations from different threads. Hence, it is important that data races are detected and removed from the code base.

To this end, Tarides has developed Thread Sanitizer (TSan) support for OCaml. TSan is an approach developed by Google to locate data races originally for C++ code bases. It works by instrumenting executables to keep a history of previous memory accesses (at a certain performance cost) in order to detect data races, even when they have no visible effect on the execution. TSan instrumentation has been implemented in various compilers (GCC, Clang, as well as the Go and Swift compilers) and has proved very effective in detecting hundreds of concurrency bugs in large projects. Executables instrumented with TSan report data races without false positives. However, data races in code paths that are not visited will not be detected.

Tarides engineers have used TSan successfully to port large non-trivial code bases such as the work-in-prograss port of Irmin to Multicore. The response from the developers using TSan has been overwhelmingly positive. A particularly attractive feature of TSan in OCaml is the ease of use. The developer merely needs to install a different compiler switch with TSan enabled, and without any additional work, TSan reports data races with accurate backtraces for the conflicting accesses. A PR for adding TSan support for OCaml is currently open. TSan support for OCaml is likely to appear in OCaml 5.2.

Property-Based Testing

Data races are just one of the hazards of parallel programming. Even without data races, the program may produce different results across several runs due to non-determinism. How can the developers gain more confidence about the correctness of their implementations? To this end, we have been developing two property-based testing libraries namely Lin and STM.

In property-based testing, the programmer provides a specification about the program that should remain true and the system tests that the properties hold under a large number of different executions, typically randomly generated inputs. In the case of Lin and STM, the program is tested under different interleavings of domains. Lin tests whether the results obtained under parallel execution correspond to the same operations applied one after the other in a sequential execution. STM take a pure model description and compares the results to the actual results seen in a parallel execution.

Both libraries have been extremely effective in identifying issues in the standard library under parallel execution. The OCaml standard library was implemented without parallel execution in mind. While much of the standard library is not parallelism-safe, we do not expect parallel access to the standard library to crash. Lin and STM have been particularly successful in identifying crashes. We believe that Lin and STM will help OCaml 5 developers gain more confidence that their code is correct under parallel execution.

Call for Action

If you have an existing OCaml code base, please try OCaml 5 today. If you find regressions, please file an issue on the OCaml GitHub repo. If you are considering utilising the new OCaml 5 features, please give the concurrency libraries and the tools a go. We would love to hear whether the libraries and tools work for you. File issues in corresponding repos if you find anything that is amiss. If you are looking for commercial support on any of these topics, do not hesitate to contact us.

All of the work discussed in this post are open-source. If you wish to contribute to these efforts, please look for the "good first issue" tag in any of these repos. If you are looking to learn, please head over to the community section to ask us questions and share and discuss OCaml-related topics.

Happy hacking!