Multicore Testing Tools: DSCheck Pt 2

by Carine Morel and Isabella Leandersson on Apr 10th, 2024

Welcome to part two! If you haven't already, check out part one, where we introduce DSCheck and share one of its uses in a naive counter implementation. This post will give you a behind-the-scenes look at how DSCheck works its magic, including the theory behind it and how to write a test for our naive counter implementation example. We’ll conclude by going a bit further, showing you how DSCheck can be used to check otherwise hard-to-prove properties in the Saturn library.

How Does DSCheck Work?

Developers use DSCheck to catch non-deterministic, hard-to-reproduce bugs in their multithreaded programs. DSCheck does so by ensuring that all the executions possible on the multiple cores (called interleavings) are valid and do not result in faults. Doing this without a designated tool like DSCheck would be incredibly resource-intensive.

In Theory

DSCheck operates by simulating parallelism on a single core, which is possible thanks to algebraic effects and a custom scheduler. DSCheck doesn't actually exhaustively 'check' all interleavings but examines a select number of relevant ones that allow it to ensure that all terminal states are valid and that no edge cases have been missed.

You may reasonably be asking yourself how this works. Well, the emergence of Dynamic Partial-Order Reduction (DPOR) methods have made DSCheck-like model checkers possible. The DPOR approach to model-checking stems from observations of real-world programs, where many interleavings are equivalent. If at least one of them is covered, so is its entire equivalent class – which is called a trace. DSCheck, therefore, checks one interleaving per trace, which lets it ensure that the whole equivalent trace is without faults.

Let's illustrate this with a straightforward example:

let a = Atomic.make 0 in
let b = Atomic.make 0 in
(* Domain A *)
Domain.spawn (fun () ->
    Atomic.set a 1; (* A1 *)
    Atomic.set b 1; (* A2 *)
    ) |> ignore;
(* Domain B *)
Domain.spawn (fun () ->
    Atomic.set a 2; (* B *)
    ) |> ignore

There are three possible interleavings: A1.A2.B, A1.B.A2, and B.A1.A2. The ordering between B and the second step of the first domain, A2, does not matter as it does not affect the same variable. Thus, the execution sequences A1.A2.B and A1.B.A2 are different interleavings of the same trace, which means that if at least one is covered, so is the other.

DPOR skips the redundant execution sequences and provides an exponential performance improvement over an exhaustive (also called naive) search. Since naive model checkers try to explore every single interleaving, and since interleavings grow exponentially with the size of code, there quickly comes a point where the number of interleavings far exceeds what the model checker can cover in a reasonable amount of time. This approach is inefficient to the degree that the only programs a naive model checker can check are so simple that it's almost useless to do so.

By reducing the amount of interleavings that need to be checked, we have significantly expanded the universe of checkable programs. DSCheck has reached a point where its performance is strong enough to test relatively long code, and most significantly, we can use it for data structure implementation.

In addition to the DPOR approach, some conditions must be met for the validations that DSCheck performs to be sound. These conditions include:

Determinism: DSCheck runs the same program multiple times (once per explored interleaving). There should be no randomness in between these executions, meaning that they should all run with the same seed, since otherwise DSCheck may miss some traces and thus miss bugs.
Data-Races: The program being tested cannot have data races between non-atomic variables, as DSCheck does not see such different behaviours. You should use TSan before running DSCheck to remove data races.
Atomics: Domains can only communicate through atomic variables. Validation, including higher-level synchronisation primitives (like mutexes), has not yet been implemented.
Lock-Free: Programs being tested have to be at least lock-free. Lock-free programs are programs that have multiple threads that share access memory, where none of the threads can block each other. If a program has a thread that cannot finish independently, DSCheck will explore its transitions ad infinitum. To partially mitigate this problem, we can force tests to be lock-free. For example, we can modify a spinlock to explicitly fail once it reaches an artificial limit.

In Practice

Let's look at how a DSCheck test can catch a bug in a naive counter implementation. To see how we set up the naive counter implementation in this example, have a look at part one of our two-part DSCheck series.

This is how to write a test for the previous naive counter module we introduced in part one:

module Atomic = Dscheck.TracedAtomic
(* The test needs to use DSCheck's atomic module *)

let test_counter () =
  Atomic.trace (fun () ->
      let counter = Counter.create () in
      Atomic.spawn (fun () -> Counter.incr counter);
      (* [Atomic.spawn] is the DSCheck function to simulate [Domain.spawn] *)
      Atomic.spawn (fun () -> Counter.incr counter);
      (* There is no need to join domains as DSCheck does not actually spawn domains. *)
      Atomic.final (fun () -> Atomic.check (fun () -> Counter.get counter == 2)))

As you can tell, the test is very similar to the main function we wrote previously to check our counter manually, but now it uses DSCheck's interface. This includes:

Shadowing the atomic module with DSCheck's TracedAtomic, which adds the algebraic effects we need to compute the interleavings
Atomic.trace takes the code for which we want to test its interleavings as an input.
Atomic.spawn simulates Domain.spawn.

In this case, DSCheck will return the following output:

Found assertion violation at run 2:
sequence 2
----------------------------------------
P0 P1
----------------------------------------
start
get a
                        start
                        get a
set a
                        set a
----------------------------------------

This output reveals the buggy interleaving with one column per domain (P0 and P1). We need to infer ourselves that a means counter here, but once we know that this is pretty straightforward to read, isn't it?

Case Study: Saturn

Let's take a closer look at using DSCheck with Saturn. Offering industrial-strength, well-tested data structures for OCaml 5, the library makes it easier for Multicore users to find data structures that fit their needs. If you use a data structure from Saturn, you can be sure it has been tested to perform well with Multicore usage. In Saturn, DSCheck has two primary uses: firstly, the one demonstrated above, i.e. catching interleavings that return buggy results; secondly, we use it to detect blocking situations.

Most data structures available through Saturn need to be lock-free. As part of the lock-free property’s definition, the structure also needs to be obstruction-free, which technically means that a domain running in isolation can always make progress or be free of blocking situations. So, if all domains bar one are paused partway through their execution, the one still working can finish without issue or being blocked. The most common blocking situation is due to a lock; if one domain acquires a lock, all the other domains must wait until the first one has released the lock to proceed.

Let's take a look at how DSCheck tests for blocking situations:

Here is an example of a code that is not obstruction-free. This is a straightforward implementation of a barrier for two domains. Both domains need to increment it to pass the whole loop.

  let barrier = Atomic.make 0 in
  let work id () =
    print_endline ("Hello world, I'm "^ id);
    Atomic.incr barrier;
    while Atomic.get barrier < 2 do
      ()
    done
  in
  let domainA = Domain.spawn (work "A") in
  let domainB = Domain.spawn (work "B") in
  Domain.join domainA;
  Domain.join domain

In this example, if B is paused by the operating system after printing "Hello world, I'm B", then A can not progress past the barrier even though it is the only domain currently working. This code is thus not obstruction-free.

If we run this code through DSCheck, here is one interleaving it will explore.

Step	Domain A	Domain B	Barrier
1		prints "Hello world, I'm B!"	0
2	prints "Hello world, I'm A!"		0
3	Increases `barrier`		`
4	Reads `barrier` and loops		1
5		Increases `barrier`	2
6		Reads `barrier` and passes the loop	2
7	Reads `barrier` and passes the loop		2

In this interleaving, domain A only performs the loop once, waiting for domain B to increase the barrier. However, nothing prevents A from looping forever here if B never takes step 5. In other words, step 4 can be repeated one, two, three… an infinite number of times, creating a new trace (i.e. a new interleaving that is not equivalent to the previous one) each time. As DSCheck will try to explore every possible trace (i.e. each equivalent class of interleavings), the test will never finish. We can determine that a test is not going to finish by noting how the explored interleavings keep growing in size. In this case, they will look like B-A-A-A-B-B-A, then B-A-A-A-A-B-B-A, then B-A-A-A-A-A-B-B-A and so on. When this scenario occurs, we can conclude that our code is not obstruction-free.

To summarise, if DSCheck runs on some accidentally blocking code, it will not be able to terminate its execution as it will have infinite traces to explore. This is one way to determine if your tested code is obstruction-free, a property that is otherwise hard to prove, and a handy test for Saturn as it's a property that most of its data structures are supposed to have. It is important to note that we have simplified our example for this post, and in practice DSCheck also checks for lock-freedom.

Want More Info?

We invite you to discover more about DSCheck and the features that come with the multicore testing suite. You can read about the library on GitHub, including more examples and performance optimisations since its initial release.

We also previously published a blog post about Multicore testing tools that includes a section on DSCheck. It provides additional helpful context to the set of tools that surround DSCheck.

Until Next Time

We hope you enjoyed this sojourn into the realm of DSCheck! If you have any questions or comments, please reach out to us on X or LinkedIN.

You can also contact us with questions or feedback and sign up for our newsletter to stay up-to-date with what we're doing.