Multicore Testing Tools: DSCheck Pt 1

by Carine Morel, Isabella Leandersson on Feb 14th, 2024

Reaping the plentiful benefits of parallel programming requires the careful management of the intricacies that come with it. Tarides played a significant part in making OCaml Multicore a reality, and we have continued to work on supporting tools that make parallel programming in OCaml as seamless as possible.

To that end, the OCaml memory model is carefully designed to help developers reason about their programs, and OCaml 5 introduced several guarantees to make multi-threaded programming safer and more predictable. Tarides also recently brought ThreadSanitizer support to OCaml, which lets users check their code for possible data races.

This post introduces the DSCheck library, a model checker written in OCaml. It helps developers catch non-deterministic, hard-to-reproduce bugs in parallel programs. Read on to discover why and how we use DSCheck to thoroughly test multi-threaded code before deploying it!

Why Use DSCheck in the First Place?

Formally, concurrent programming means that an application is making progress on more than one task at the same time. In addition, parallel programming allows for more than one concurrent process to happen simultaneously. When several concurrent processes share a resource in parallel, the complexity and the possibility of bugs increases by several degrees.

This is why developers hear the terms deadlock, starvation, and data races so often when they learn about multicore programming or concurrence. These terms are all different ways of saying the same thing: multicore programming is hard!

Enter DSCheck and other multicore testing tools! DSCheck is a testing tool with a particular use case: it is used for algorithms that do not spawn domains themselves, are lock-free, and use atomics. Developers can use lock-free multicore programming to significantly boost performance, but they need to check that all the executions possible on the multiple cores (called interleavings) are valid to assert that their program is without faults. Manually, this would be a gargantuan task but with DSCheck, it's made easy!

When To Use DSCheck: A Naive Counter Implementation

Let’s look at a practical example of when to use DSCheck. For instance, if we choose to implement a naive counter in OCaml Multicore, it might look something like this:

module Counter = struct 
	type t = int Atomic.t 
	let create () = Atomic.make 0
	let incr counter = 
		let old_value = Atomic.get counter in 
		Atomic.set counter (old_value + 1)
	let get counter = Atomic.get counter 

Now, say we increment the counter on 2 domains in parallel. If our implementation is correct, the counter should be at 2 at the end.

let main () =
	let counter = Counter.create () in
	let domainA = Domain.spawn (fun () -> Counter.incr counter) in
	let domainB = Domain.spawn (fun () -> Counter.incr counter) in 
	Domain.join domainA;
	Domain.join domainB;
	assert (Counter.get counter = 2)

However, there are several ways in which the incrementation can go wrong1 and cause the counter to actually hold the value 1 by the end of the execution. To figure out what has gone wrong, we need to unfold all the possible interleavings in which the domains can execute their accesses to their shared values – here only counter. Both domains perform two accesses to the counter: first a read (get) then a write (set).

Since OCaml’s atomics guarantee that program order is kept between atomic operations on a single domain, we can not reorder the operations made by the same domain.

However, there are still a few possible interleavings. Domain B could for example only begin working after A is done. This gives us interleaving 1:

Step Domain A Domain B Counter
1 Atomic.get counter 0
2 Atomic.set counter (0+1) 1
3 Atomic.get counter 1
4 Atomic.set counter (1+1) 2

Or they could do the same in reverse order (interleaving 2):

Step Domain A Domain B Counter
1 Atomic.get counter 0
2 Atomic.set counter (0+1) 1
3 Atomic.get counter 1
4 Atomic.set counter (1+1) 2

Or they could actually interleave their actions, which results in interleaving 3 (and interleaving 4 by permuting A and B)

Step Domain A Domain B Counter
1 Atomic.get counter 0
2 Atomic.get counter 0
3 Atomic.set counter (0+1) 1
4 Atomic.set counter (0+1) 1

Interleavings 1 and 2 work fine as the counter ends up with a value of 2. A problem arises in interleavings 3 and 4, since both A and B witness the counter with a 0 value thus causing the counter to end up with a value of 1.

This is obviously quite a simple example (and yes, you can totally avoid this bug by using Atomic.incr), but what it shows is that even with just two domains doing the same thing, composed of only two lines of code, we end up with four different interleavings. This is why we need DSCheck!

DSCheck is a model checker. Its job is to find all possible interleavings (what we just did manually) and test that every single one returns the expected result. If not, DSCheck returns the first interleaving it finds that is not working, and lets you do the work of debugging the code now that you know where and what the problem is. We will show you how we write a test for a naive counter implementation using DSCheck in part 2, so keep a look out for that.

What is DSCheck Currently Used For?

We have used DSCheck in several of the libraries and tools we maintain to verify certain aspects of them. For example, in the Saturn library of parallelism-safe data structures for OCaml, DSCheck is used to verify both the lock-freedom and safety properties of the structures. The effects-based IO stack Eio also has its internal lock-free data structures verified with DSCheck. In short, DSCheck lets you build complex Multicore data structures with confidence.

Part Two

We'll end this part here now that you have an idea of what DSCheck is and when it is used. Next time, we will look at how DSCheck works in greater detail, including how we use it in Saturn. Check out our blog for part two, and connect with us on X and LinkedIn to stay up-to-date with what we're up to. See you soon!

  1. Note that the probability of this bug happening is low and may be hard to witness. You will most likely need to add a barrier to synchronise the domains.