Deep Dive: Optimising Multicore OCaml for Windows

by Isabella Leandersson on Jul 10th, 2024

We love hosting internships. It is rewarding to potentially facilitate someone’s first foray into the OCaml ecosystem, helping them establish a hopefully life-long foothold in the world of open-source programming. It is also a great opportunity to get new perspectives on existing issues. Fresh eyes can reinvigorate topics, highlighting different pain points and new solutions which benefit the entire community.

Sometimes, we also find ourselves just trying to keep up with our interns as they take off like rocket ships! Recently, we mentored a student who did just that. The initial goal of the internship was to investigate strange performance drops in the OCaml runtime that arose after the introduction of multicore support. These performance drops were most keenly felt on Windows machines, and the initial internship specification emphasised the need to improve the developer experience on that operating system.

Our intern @eutro went above and beyond anything we could have expected and tackled the project thoroughly and ambitiously. In this post, I will attempt to give you a comprehensive overview of this intricate project and the problems it tackled.

Get Busy Waiting?

Before OCaml 5, only one thread would run at any given time. Users never had to worry about multiple threads trying to use a shared resource like the Garbage Collector (GC). In OCaml 5, however, the process is divided into several 'threads'¹, and multiple threads regularly try to run parts of the GC simultaneously. The minor GC uses a Stop The World (STW) function to run in parallel on all threads, whereas the major GC’s work is split into slices. These may happen in parallel between threads and while the user’s program (also called the ‘mutator’) is making changes. This is one example of when a mechanism is needed to protect multiple threads from making changes that contradict each other and result in unexpected behaviours.

Locks are the traditional way of doing this, whereby other activity is halted (or locked) while one activity finishes. However, in multicore programming, this method would be incredibly inefficient since there can be many activities in progress simultaneously. In this case, we would need to introduce so many locks for the different parts of memory that doing so would cause memory and OS resource problems!

The approach we use for OCaml 5 combines a Compare And Swap (CAS) operation with Busy-Wait loops. A CAS operation ensures that if two threads try to modify the same area of memory, only one will succeed. The one that fails will know it has failed and can then enter a period of Busy-Waiting (called SPIN_WAIT in the code). Busy-wait loops (also referred to as spins) describe a process that repeatedly ('busily') checks whether a condition is true. The process or task is only resumed once that condition is met.

Sleeping Beauty

Busy-wait loops are used successfully in OCaml for many purposes but have been optimised. They are mostly appropriate in cases where we think that the required condition will be met quickly or in a reasonable period of time. If that’s not the case, then theoretically, the thread that is waiting will just keep spinning. If one allows busy-wait loops to spin indefinitely, they waste a lot of power and CPU and can actually prevent the condition they are waiting for from being met. To avoid that happening, we can use a sleep function.

In order to implement spinning without wasting power, the loop checks the condition repeatedly, but after a while, it starts 'sleeping' between checks. Suppose a thread is waiting for condition C to come true, and it uses a Busy-Wait loop to check for this. The program spins a number of times, checking the condition, and then waits or goes to ‘sleep’ for a set amount of time – then it ‘wakes up’ and checks once more before (if it has to) going back to ‘sleep‘ again. The period of ‘sleep’ increases each time. This cycle repeats itself until the condition C finally comes true.

This was how the process was supposed to work, yet, for some unknown reason, certain processes would occasionally take much longer than expected. The performance drop was worst on Windows machines.

Testing 1-2-3, Testing 1-2-3,

The first order of business was to conduct a series of tests on the runtime. Not only to discover the possible cause of the performance drops but also to establish a baseline of performance against which to measure any changes (and hopefully improvements!).

We knew that there was a performance problem and that it was particularly painful on Windows, but we didn’t know why. Even if we had a hunch as to what might be causing it, it was crucial to build up a realistic image of what was happening before we attempted a fix.

@eutro began this process by identifying where Busy-Wait loops were spinning in the runtime and for how long. She also wanted to know if there were places in the runtime where processes would get ‘stuck’ in Busy-Wait loops and not move on, and if so, where and why.

She used the OCaml testsuite and measured how many SPIN_WAIT macros resolved successfully without needing to sleep and which ones did not. She discovered that in most cases, the spinning had the desired effect, and the process could continue after a reasonable amount of time when the condition it was waiting for was met. The garbage collector was also not experiencing any significant performance drops, so it could not be the cause of the problems on Windows. Instead, what she realised was that on Windows, sleeps cannot be shorter than one millisecond, and so the first sleeps that occur end up being much too long. This causes extended and unnecessary delays for processes running on Windows. Equipped with this realisation, @eutro got started on a solution. One that would be most helpful on Windows but still benefit users on other operating systems.

Barriers and Futexes, Oh My!

There are a few ways a thread in OCaml can wait for a condition:

First, we may be able to proceed very soon (nanoseconds), in which case we will spin until we can proceed.
Then, if spinning doesn’t let us proceed, we sleep a few times until the condition comes true. In most cases (read: not on Windows), the first few sleeps are still very quick, and we can proceed soon once the condition is met.
An alternative to sleeping ‘blindly’ like this is to tell the OS specifically what we are waiting for so that we can be woken up only when we know the condition is true. You can think of this as taking a ticket and waiting for your number to be called rather than repeatedly asking if you can be seen.

So what has changed? As things stood, only steps one and two were available, a series of increasingly long sleeps interleaved with checks. So you would spin n times, then sleep for 10µs (‘µs’ is short for microseconds), then you check the condition once more and might sleep for 20µs, then 35µs, and so on. The point is that the time spent sleeping kept gradually increasing.

However, as @eutro discovered, in many cases, the process took far too long to resume, even after the condition had come true. By the time they woke up from sleeping, they could have already proceeded if they had just ‘taken a ticket’ earlier and waited until they were notified. To improve performance, instead of repeatedly sleeping for longer increments, we use specialised ‘barriers’ to wait until we can proceed.

To solve the Windows problem, we now use the SPIN_WAIT function only in cases where we don’t expect to ever need to sleep. In cases where that first sleep would cause significant delay, we introduce a new SPIN_WAIT_NTIMES function, which lets the process spin for a set number of times before being combined with a barrier. @eutro used her previous benchmarks to determine which occasions could keep the SPIN_WAIT cycle as-is and which occasions required the new SPIN_WAIT_NTIMES combined with a barrier.

But things didn’t stop there! @eutro could also optimise the type of barrier. Traditionally, we use condition variables to wake up threads waiting on a condition. However, they are unnecessarily resource-intensive as they require extra memory, and since woken threads must acquire (and release) a lock before they continue. A futex is a lower-level synchronisation primitive that can similarly be used to wake up threads but without the added complexity of a condition variable.

@eutro added the use of futexes to the operating systems that permitted her to do so. Notably, macOS does not allow non-OS programs to use futexes so we fall back to using "condition variables" there.

By introducing the use of SPIN_WAIT_NTIMES, barriers, and futexes, @eutro implemented a number of optimisations that were applicable not only on Windows but on several operating systems. These optimisations save users time and processing power.

How Much do You Bench(mark)?

During the course of implementing these changes, @eutro did a lot of tests. It was important to be thorough in order to ensure that her changes didn’t have unintended consequences. It is incredibly difficult to reason about how programs will react to a specific alteration, as there are many things happening in the runtime and several ways that programs can interact.

She used the OCaml test suite again, this time to help her check that the OCaml runtime and other OCaml programs functioned correctly. Having verified that they were, @eutro also ran several benchmarks to check that she hadn’t actually made anything slower. For this, she used the Sandmark test suite.

I recommend checking out the tests and benchmarks for yourself in the Pull Request. The PR also gives a more in-depth technical overview of the changes to the Busy-Waiting loops.

You Can Join Us Too!

It is great to see what someone with a passion for OCaml can bring to the system as a whole. I think it illustrates the benefits of open-source software development: when we invite fresh observations and suggestions, we create a community that supports innovation and collaboration. We are impressed with the hard work @eutro put into solving the complicated problem before her. She went above and beyond what we thought possible in such a short amount of time!

Would you like to complete an internship with us? We welcome people of varying experience levels – some interns have made open-source contributions before and are familiar with functional programming, and some of our interns have no functional programming experience at all! If you’re interested in developing your skills in a supportive environment, keep an eye on our careers page, where we post about any available internships. We also regularly communicate about available internships on X (formerly known as Twitter). We hope to hear from you!

Tarides champions open-source development. We create and maintain key features of the OCaml language in collaboration with the OCaml community. To learn more about how you can support our open-source work, discover our page on GitHub.

We are always happy to discuss commercial opportunities around OCaml. We provide core services, including training, tailor-made tools, and secure solutions. Contact us today to learn more about how Tarides can help your teams realise their vision.

We are aware of the distinction between ‘threads’ and ‘domains’ in OCaml. We chose to use the former here, mainly to keep the content accessible for people who are less familiar with the subtleties of OCaml.↩