How TSan Makes OCaml Better: Data Races Caught and Fixed

by Olivier Nicole, Fabrice Buoro, and Isabella Leandersson on Aug 21st, 2024

Parallel programming opens up brand-new possibilities. Using multiple cores means that users can benefit from powerful OCaml features (like formal proofs and high security) while enjoying greater performance, enabling them to improve their services or projects.

However, introducing such a significant change to the OCaml ecosystem would not be practical without providing tools that help users ensure memory safety in parallel programming. This is where multicore tests come in, as does ThreadSanitizer (TSan) support for OCaml. We have published two previous posts on TSan, an overview of the tool and an introduction to the danger of data races in general. This post will give you a behind-the-scenes look at how we have used TSan to find and fix races in the OCaml runtime. Official support for TSan arrived with OCaml 5.2, but as you can tell from the repository, we have been using TSan internally for a while now.

Catching Races in OCaml

As a result of TSan coming with the OCaml 5.2 update, several bug fixes in the same update addressed data races. A data race occurs when two accesses are made to the same memory location; at least one is a write, and no order is enforced between them. Data race bugs can be hard to spot, but since they can result in unexpected behaviours, they are a high-priority item to fix.

It is important to note that not all data races are equal. In OCaml, the memory model guarantees that memory safety is preserved even when data races occur. This makes data races in OCaml much 'safer' than in many other languages, where races can impact memory safety. OCaml programs also require support from the OCaml runtime which provides low-level operations such as memory allocation and garbage collection. The OCaml runtime is written in C and must be data race-free according to the C memory model since a data race in the runtime would impact the validity of the whole program. While TSan has been integrated to detect data races in OCaml code, it has also proven invaluable in detecting errors in OCaml's runtime, many of which were subsequently fixed in 5.2.

What has TSan Caught so Far?

Ecosystem contributors make continuous efforts towards maintenance, and as part of this work, use TSan to check the OCaml runtime. To further simplify TSan usage we have added a TSan Continuous Integration (CI) run that executes the OCaml test suite in a TSan-enabled switch, automatically detecting inadvertently introduced data races in the runtime. This has allowed us to catch and fix several data races. Some of these include:

  • Fixing a Race in the Minor GC: PRs #12595 describes a race condition occurring when caml_collect_gc_stats_sample made calls to domain_terminate, and #12597 outlines the fix implemented in 5.2. In this data race, the internal garbage collector data could cause the program to report incorrect garbage collection statistics.

  • Data Race Between Marking and Sweeping Garbage Collector Phases: PR #12934 fixes a race between marking and sweeping functions caught by TSan. When the garbage collector marks a value for collection, sweeping code (a later phase of the GC, effectively marking unreachable values as free space) in another thread may read the value simultaneously. This is normal behaviour, but the memory read was not marked as an atomic operation as it should have been, introducing a risk of undefined behaviour.

  • Data Race on Global Pools Arrays: PR #12755 addresses races on global_avail_pools and global_full_pools members of the struct pool_freelist in shared_heap.c.
    For performance reasons, OCaml's runtime allocates major heap memory in chunks stored in pools that can be recycled across domains. While the algorithm for accessing them in parallel was correct, atomic qualifiers and explicitly qualified memory operations were missing, causing TSan to report unsynchronised memory accesses. The PR adds the necessary qualifiers.

  • Data Races in minor_gc.c: This PR #12737 fixes two races. One in the minor GC occurring when promoting values in the remembered set, and one in the Dynlink library happening due to an incorrect C function trying to access an OCaml value even when the garbage collector might be running.

  • Data Race fix for #12799: PR #12851 fixes a bug described in issue #12799. When a domain terminates, it emits a runtime event – a piece of information that can be monitored using a dedicated API to debug or profile the performance of OCaml programs. However, under certain circumstances, the emission of this event could race with the shutdown of the runtime events system itself, possibly leading to incorrect information being emitted or, worse, memory corruption. Proper synchronisation has been implemented to ensure this doesn't happen.

  • Data Race When Using the Debug Runtime: PR #12969 resolves a data race involving caml_scan_stack and caml_free_stack. It was possible for two domains to perform a garbage collector marking on the same fibre, and in very rare cases when that fibre was terminating. This caused TSan to report a data race. While the code was correct in practice, we fixed the access to make it correct according to the C11 memory model, thus avoiding undefined behaviour.

Until Next Time!

Seeing a tool we have developed so quickly benefit the larger ecosystem is excellent. TSan helps developers test their parallel programs for potential risks they would have difficulty discovering. We look forward to seeing users put TSan to the test and share the results!

The OCaml community welcomes contributions and feedback and invites users to share any issues in the OCaml GitHub repo. The discussion forum OCaml Discuss is another place to share your thoughts and get input from others in the community.

Would you like to stay up-to-date with us? Follow us on X (formerly known as Twitter) and LinkedIn to see regular posts on our projects, announcements, tutorials, and more.

Tarides champions open-source development. We create and maintain key features of the OCaml language in collaboration with the OCaml community. To learn more about how you can support our open-source work, discover our page on GitHub.

We are always happy to discuss commercial opportunities around OCaml. We provide core services, including training, tailor-made tools, and secure solutions. Contact us today to learn more about how Tarides can help your teams realise