Kcas: Building a Lock-Free STM for OCaml (1/2)

In the past few months I've had the pleasure of working on the Kcas library. In this and a follow-up post, I will discuss the history and more recent development process of optimising Kcas and turning it into a proper Software Transactional Memory (STM) implementation for OCaml.

While this is not meant to serve as an introduction to programming with Kcas, along the way we will be looking at a few code snippets. To ensure that they are type correct — the best kind of correct^* — I'll use the MDX tool to test them. So, before we continue, let's require the libraries that we will be using:

# #require "kcas"
# open Kcas
# #require "kcas_data"
# open Kcas_data

All right, let us begin!

Origins

Contrary to popular belief, the name "Kcas" might not be an abbreviation of KC and Sadiq. Sadiq once joked "I like that we named the library after KC too." — two early contributors to the library. The Kcas library was originally developed for the purpose of implementing Reagents for OCaml and is an implementation of multi-word compare-and-set, often abbreviated as MCAS, CASN, or — wait for it — k-CAS.

But what is this multi-word compare-and-set?

Well, it is a tool for designing lock-free algorithms that allows atomic operations to be performed over multiple shared memory locations. Hardware traditionally only supports the ability to perform atomic operations on individual words, i.e. a single-word compare-and-set (CAS). Kcas basically extends that ability, through the use of intricate algorithms, so that it works over any number of words.

Suppose, for example, that we are implementing operations on doubly-linked circular lists. Instead of using a mutable field, ref, or Atomic.t, we'd use a shared memory location, or Loc.t, for the pointers in our node type:

type 'a node = {
  succ: 'a node Loc.t;
  pred: 'a node Loc.t;
  datum: 'a;
}

To remove a node safely we want to atomically update the succ and pred pointers of the predecessor and successor nodes and to also update the succ and pred pointers of a node to point to the node itself, so that removal becomes an idempotent operation. Using a multi-word compare-and-set one could implement the remove operation as follows:

let rec remove ?(backoff = Backoff.default) node =
  (* Read pointer to the predecessor node and... *)
  let pred = Loc.get node.pred in
  (* ..check whether the node has already been removed. *)
  if pred != node then
    let succ = Loc.get node.succ in
    let ok = Op.atomically [
      (* Update pointers in this node: *)
      Op.make_cas node.succ succ node;
      Op.make_cas node.pred pred node;
      (* Update pointers to this node: *)
      Op.make_cas pred.succ node succ;
      Op.make_cas succ.pred node pred;
    ] in
    if not ok then
      (* Someone modified the list around us, so backoff and retry. *)
      remove ~backoff:(Backoff.once backoff) node

The list given to Op.atomically contains the individual compare-and-set operations to perform. A single Op.make_cas loc expected desired operation specifies to compare the current value of a location with the expected value and, in case they are the same, set the value of the location to the desired value.

Programming like this is similar to programming with single-word compare-and-set except that the operation is extended to being able to work on multiple words. It does get the job done, but I feel it is fair to say that this is a low level tool only suitable for experts implementing lock-free algorithms.

Getting Curious

I became interested in working on the Kcas library after Bartosz Modelski asked me to review a couple of PRs to Kcas. As it happens, I had implemented the same k-CAS algorithm, based on the paper A Practical Multi-Word Compare-and-Swap Operation, a few years earlier in C++ as a hobby project. I had also considered implementing Reagents and had implemented a prototype library based on the Transactional Locking II (TL2) algorithm for software transactional memory (STM) in C++ as another hobby project. While reviewing the library, I could see some potential for improvements.

Fine Grained Competition

One of the issues in the Kcas Github repo mentioned a new paper on Efficient Multi-word Compare and Swap. It was easy to adapt the new algorithm, which can even be seen as a simplification of the previous algorithm, to OCaml. Compared to the previous algorithm, which took 3k+1 single word CAS operations per k-CAS, the new algorithm only took k+1 single word CAS operations and was much faster. This basically made k-CAS potentially competitive with fine grained locking approaches, that also tend to require roughly the equivalent of one CAS per word, used in many STM implementations.

Two Birds with One Stone

Both the original algorithm and the new algorithm require the locations being updated to be in some total order. Any ordering that is used consistently in all potentially overlapping operations would do, but the shared memory locations created by Kcas also include a unique integer id, which can be used for ordering locations. Initially Kcas required the user to sort the list of CAS operations. Later an internal sorting step, that was performed by default by essentially calling List.sort and taking linearithmic O(n*log(n)) time, was added to Kcas to make the interface less error prone. This works, but it is possible to do better. Back when I implemented a TL2 prototype in C++ as a hobby project, I had used a splay tree to record accesses of shared memory locations. Along with the new algorithm, I also changed Kcas to use a splay tree to store the operations internally. The splay tree was constructed from the list of operations given by the user and then the splay tree, instead of a list, would be traversed during the main algorithm.

You could ask what makes a splay tree interesting for this particular use case. Well, there are a number of reasons. First of all, the new algorithm requires allocating internal descriptors for each operation anyway, because those descriptors are essentially consumed by the algorithm. So, even when the sorting step would be skipped, an ordered data structure of descriptors would still need to be allocated. However, what makes a splay tree particularly interesting for this purpose is that, unlike most self-balancing trees, it can perform a sequence of n accesses in linear time O(n). This happens, for example, when the accesses are in either ascending or descending order. In those cases, as shown in the diagram below, the result is either a left or right leaning tree, respectively, much like a list.

resulting spray tree

This means that a performance conscious user could simply make sure to provide the locations in either order and the internal tree would be constructed in linear time and could then be traversed, also in linear time, in ascending order. For the general case a splay tree also guarantees the same linearithmic O(n*log(n)) time as sorting.

With some fast path optimisations for preordered sequences the splay tree construction was almost free and the flag to skip the by default sorting step could be removed without making performance worse.

Keeping a Journal

Having the splay tree also opened the possibility of implementing a higher level transactional interface.

But what is a transaction?

Well, a transaction in Kcas is essentially a function that records a log of accesses, i.e. reads and writes, to shared memory locations. When accessing a location for the first time, whether for reading or for writing, the value of that location is read and stored in the log. Then, instead of reading the location again or writing to it, the entry for the location is looked up from the log and any change is recorded in the entry. So, a transaction does not directly mutate shared memory locations. A transaction merely reads their initial values and records what the effects of the accesses would be.

Recall the example of how to remove a node from a doubly-linked circular list. Using the transactional interface of Kcas, we could write a transaction to remove a node as follows:

let remove ~xt node =
  (* Read pointers to the predecessor and successor nodes: *)
  let pred = Xt.get ~xt node.pred in
  let succ = Xt.get ~xt node.succ in
  (* Update pointers in this node: *)
  Xt.set ~xt node.succ node;
  Xt.set ~xt node.pred node;
  (* Update pointers to this node: *)
  Xt.set ~xt pred.succ succ;
  Xt.set ~xt succ.pred pred

The labeled argument, ~xt, refers to the transaction log. Transactional operations like get and set are then recorded in that log. To actually remove a node, we need to commit the transaction

Xt.commit { tx = remove node }

which repeatedly calls the transaction function, tx, to record a transaction log and attempts to atomically perform it until it succeeds.

Notice that remove is no longer recursive. It doesn't have to account for failure or perform a backoff. It is also not necessary to know or keep track of what the previous values of locations were. All of that is taken care of for us by the transaction log and the commit function. But, I digress.

Having the splay tree made the implementation of the transactional interface straightforward. Transactional operations would just use the splay tree to lookup and record accesses of shared memory locations. The commit function just calls the transaction with an empty splay tree and then passes the resulting tree to the internal k-CAS algorithm.

But why use a splay tree? One could suggest e.g. using a hash table for the transaction log. Accesses of individual locations would then be constant time. However, a hash table doesn't sort the entries, so we would need something more for that purpose. Another alternative would be to just use an unordered list or table and perhaps use something like a bloom filter to check whether a location has already been accessed as most accesses are likely to either target new locations or a recently used location. However, with k-CAS, it would still be necessary to sort the accesses later and, without some way to perform efficient lookups, worst case performance would be quadratic O(n²).

For the purpose of implementing a transaction log, rather than just for the purpose of sorting a list of operations, a splay tree also offers further advantages. A splay tree works a bit like a cache, making accesses to recently accessed elements faster. In particular, the pattern where a location is first read and then written

let decr_if_positive ~xt x =
  if Xt.get ~xt x > 0 then
    Xt.decr ~xt x

is optimised by the splay tree. The first access brings the location to the root of the tree. The second access is then guaranteed constant time.

Using a splay tree as the transaction log also allows the user to optimise transactions similarly to avoiding the cost of the linearithmic sorting step. A transaction over an array of locations, for example, can be performed in linear time simply by making sure that the locations are accessed in order.

Of course, none of this means that a splay tree is necessarily the best or the most efficient data structure to implement a transaction log. Far from it. But in OCaml, with fast memory allocations, it is probably difficult to do much better without additional runtime or compiler support.

Take a Number

One nice thing about transactions is that the user no longer has to write loops to perform them. With a primitive (multi-word) CAS one needs to have some strategy to deal with failures. If an operation fails, due to another CPU core having won the race to modify some location, it is generally not a good idea to just immediately retry. The problem with that is that there might be multiple CPU cores trying to access the same locations in parallel. Everyone always retrying at the same time potentially leads to quadratic O(n²) bus traffic to synchronise shared memory as every round of retries generates O(n) amount of bus traffic.

Suppose multiple CPU cores are all simultaneously running the following naïve lock-free algorithm to increment an atomic location:

let rec naive_incr atomic =
  let n = Atomic.get atomic in
  if not (Atomic.compare_and_set atomic n (n + 1)) then
    naive_incr atomic

All CPU cores read the value of the location and then attempt a compare-and-set. Only one of them can succeed on each round of attempts. But one might still reasonably ask: what makes this so expensive? Well, the problem comes from the way shared memory works. Basically, when a CPU core reads a location, the location will be stored in the cache of that core and will be marked as "shared" in the caches of all CPUs that have also read that location. On the other hand, when a CPU core writes to a location, the location will be marked as "modified" in the cache of that core and as "invalid" in the caches of all the other cores. Although a compare-and-set doesn't always logically write to memory, to ensure atomicity, the CPU acts as if it does. So, on each round through the algorithm, each core will, in turn, attempt to write to the location, which invalidates the location in the caches of all the other cores, and require them to read the location again. These invalidations and subsequent reads of the location tend to be very resource intensive.

In some lock-free algorithms it is possible to use auxiliary data structures to deal with contention scalably, but when the specifics of the use case are unknown, something more general is needed. Assume that, instead of all the cores retrying at the same time, the cores would somehow form a queue and attempt their operations one at a time. Each successful increment would still mean that the next core to attempt increment would have to expensively read the location, but since only one core makes the attempt, the amount of bus traffic would be linear O(n).

A clever way to form a kind of queue is to use randomised exponential backoff. A random delay or backoff is applied before retrying:

let rec incr_with_backoff ?(backoff = Backoff.default) atomic =
  let n = Atomic.get atomic in
  if not (Atomic.compare_and_set atomic n (n + 1)) then
    incr_with_backoff ~backoff:(Backoff.once backoff) atomic

If multiple parties are involved, this makes them retry in some random order. At first everyone retries relatively quickly and that can cause further failures. On each retry the maximum backoff is doubled, increasing the probability that retries are not performed at the same time. It might seem somewhat counterintuitive that waiting could improve performance, but this can greatly reduce the amount of synchronisation and improve performance.

The Kcas library already employed a backoff mechanism. Many operations used a backoff mechanism internally and allocated an object to hold the backoff configuration and state as the first thing. To reduce overheads and make the library more tunable, I redesigned the backoff mechanism to encode the configuration and state in a single integer so that no allocations are required. I also changed the operations to take the backoff as an optional argument so that users could potentially tune the backoff for specific cases, such as when a particular transaction should take priority and employ shorter backoffs, or the opposite.

Free From Obstructions

The new k-CAS algorithm was efficient, but it was limited to CAS operations that always wrote to shared memory locations. Interestingly, a CAS operation can also express a compare (CMP) operation — just use the same value as the expected and desired value, Op.make_cas loc expected expected.

One might wonder; what is the use of read-only operations? It is actually common for the majority of accesses to data structures to be read-only and even read-write operations of data structures often involve read-only accesses of particular locations. As explained in the paper Nonblocking k-compare-single-swap, to safely modify a singly-linked list typically requires not only atomically updating a pointer, but also ensuring that other pointers remain unmodified.

The problem with using a read-write CAS to express a read-only CMP is that, due to the synchronisation requirements, writes to shared memory are much more expensive than reads. Writes to a single location cannot proceed in parallel. Multiple cores trying to "read" a location in memory using read-write CASes would basically cause similar expensive bus traffic, or cache line ping-pong, as with the previously described naïve increment operation — without even attempting to logically write to memory.

To address this problem I extended the new lock-free k-CAS algorithm to a brand new obstruction-free k-CAS-n-CMP algorithm that allows one to perform a combination of read-write CAS and read-only CMP operations. The extension to k-CAS-n-CMP is a rather trivial addition to the k-CAS algorithm. The gist of the k-CAS-n-CMP algorithm is to perform an additional step to validate all the read-only CMP accesses before committing the changes. This sort of validation step is a fairly common approach in non-blocking algorithms.

The obstruction-free k-CAS-n-CMP algorithm also retains the lock-free k-CAS algorithm as a subset. In cases where only CAS operations are performed, the k-CAS-n-CMP algorithm does the exact same thing as the k-CAS algorithm. This allows a transaction mechanism based on the k-CAS-n-CMP algorithm to easily switch to using only CAS operations to guarantee lock-free behavior. The difference between an obstruction-free and a lock-free algorithm is that a lock-free algorithm guarantees that at least one thread will be able to make progress. With the obstruction-free validation step it is possible for two or more threads to enter a livelock situation, where they repeatedly and indefinitely fail during the validation step. By switching to lock-free mode, after detecting a validation failure, it is possible to avoid such livelocks.

Giving Monads a Pass

The original transactional API to k-CAS actually used monadic combinators. Gabriel Scherer suggested the alternative API based on passing a mutable transaction log explicitly that we've already used in the examples. This has the main advantage that such an API can be easily used with all the existing control flow structures of OCaml, such as if then else and for to do as well as higher-order functions like List.iter, that would need to be encoded with combinators in the monadic API.

On the other hand, a monadic API provides a very strict abstraction barrier against misuse as it can keep users from accessing the transaction log directly. The transaction log itself is not thread safe and should not be accessed or reused after it has been consumed by the main k-CAS-n-CMP algorithm. Fortunately there is a way to make such misuse much more difficult as described in the paper Lazy Functional State Threads by employing higher-rank polymorphism. By adding a type variable to the type of the transaction log

type 'x t

and requiring a transaction to be universally quantified

type 'a tx = {
  tx : 'x. xt:'x t -> 'a;
}

with respect to the transaction log, the type system prevents a transaction log from being reused:

# let futile x =
    let log = ref None in
    let tx ~xt =
      match !log with
      | None ->
        log := Some xt;
        raise Retry.Later
      | Some xt ->
        Xt.get ~xt x in
    Xt.commit { tx }
Line 10, characters 17-19:
Error: This field value has type xt:'a Xt.t -> 'b which is less general than
         'x. xt:'x Xt.t -> 'c

It is still possible to e.g. create a closure that refers to a transaction log after it has been consumed, but that requires effort from the programmer and should be unlikely to happen by accident.

The explicit transaction log passing API proved to work well and the original monadic transaction API was then later removed from the Kcas library to avoid duplicating effort.

Division of Labour

When was the last time you implemented a non-trivial data structure or algorithm from scratch? For most professionals the answer might be along the lines of "when I took my data structures course at the university" or "when I interviewed for the software engineering position at Big Co".

Kcas aims to be usable both

for experts implementing correct and performant lock-free data structures, and
for everyone gluing together programs using such data structures.

Implementing lock-free data structures, even with the help of k-CAS-n-CMP, is not something everyone should be doing every time they are writing concurrent programs. Instead programmers should be able to just reuse carefully constructed data structures.

As an example, consider the implementation of a least-recently-used (LRU) cache or a bounded associative map. A simple sequential approach to implement a LRU cache is to use a hash table and a doubly-linked list and keep track of the amount of space in the cache:

type ('k, 'v) cache =
  { space: int Loc.t;
    table: ('k, 'k Dllist.node * 'v) Hashtbl.t;
    order: 'k Dllist.t }

On a cache lookup the doubly-linked list node corresponding to the accessed key is moved to the left end of the list:

let get_opt {table; order; _} key =
  Hashtbl.find_opt table key
  |> Option.map @@ fun (node, datum) ->
     Dllist.move_l node order; datum

On a cache update, in case of overflow, the association corresponding to the node on the right end of the list is dropped:

let set {table; order; space; _} key datum =
  let node =
    match Hashtbl.find_opt table key with
    | None ->
      if 0 = Loc.update space (fun n -> max 0 (n-1))
      then Dllist.take_opt_r order
           |> Option.iter (Hashtbl.remove table);
      Dllist.add_l key order
    | Some (node, _) -> Dllist.move_l node order; node
  in
  Hashtbl.replace table key (node, datum)

Sequential algorithms such as the above are so common that one does not even think about them. Unfortunately, in a concurrent setting the above doesn't work even if the individual operations on lists and hash tables were atomic.

As it happens, the individual operations used above are actually atomic, because they come from the kcas_data package. The kcas_data package provides lock-free and parallelism safe implementations of various data structures.

But how would one make the operations on a cache atomic as a whole? As explained by Maurice Herlihy in one of his talks on Transactional Memory adding locks to protect the atomicity of the operation is far from trivial.

Fortunately, rather than having to e.g. wrap the cache implementation behind a mutex and make another individually atomic yet uncomposable data structure, or having to learn a completely different programming model and rewrite the cache implementation, we can use the transactional programming model provided by the Kcas library and the transactional data structures provided by the kcas_data package to trivially convert the previous implementation to a lock-free composable transactional data structure.

To make it so, we simply use transactional versions, *.Xt.*, of operations on the data structures and explicitly pass a transaction log, ~xt, to the operations. For the get_opt operation we end up with

let get_opt ~xt {table; order; _} key =
  Hashtbl.Xt.find_opt ~xt table key
  |> Option.map @@ fun (node, datum) ->
     Dllist.Xt.move_l ~xt node order; datum

and the set operation is just as easy to convert to a transactional version. One way to think about transactions is that they give us back the ability to compose programs such as the above. But, I digress, again.

It was not immediately clear whether Kcas would be efficient enough. A simple node based queue, for example, seemed to be significantly slower than an implementation of the Michael-Scott queue using atomics. How so? The reason is fundamentally very simple. Every shared memory location takes more words of memory, every update allocates more, and the transaction log also allocates memory. All the extra words of memory need to be written to by the CPU and this invariably takes some time and slows things down.

For the implementation of high-performance data structures it is important to offer ways, such as the ability to take advantage of the specifics of the transaction log, to help ensure good performance. A common lock-free algorithm design technique is to publish the desire to perform an operation so that other parties accessing the same data structure can help to complete the operation. With some care and ability to check whether a location has already been accessed within a transaction it is possible to implement such algorithms also with Kcas.

Using such low level lock-free techniques, it was possible to implement a queue using three stacks:

type 'a t = {
  front : 'a list Loc.t;
  middle : 'a list Loc.t;
  back : 'a list Loc.t;
}

The front stack is reversed so that, most of the time, to take an element from the queue simply requires popping the top element from the stack. Similarly to add an element to the queue just requires pushing the element to the top of the back stack. The difficult case is when the front becomes empty and it is necessary to move elements from the back to the front.

The third stack acts as a temporary location for publishing the intent to reverse it to the front of the queue. The operation to move the back stack to the middle can be done outside of the transaction, as long as the back and the middle have not yet been accessed within the transaction.

The three-stack queue turned out to perform well — better, for example, than some non-compositional lock-free queue implementations. While Kcas adds overhead, it also makes it easier to use more sophisticated data structures and algorithms. Use of the middle stack, for example, requires atomically updating multiple locations. With plain single-word atomics that is non-trivial.

Similar techniques also allowed the Hashtbl implementation to perform various operations on the whole hash table in ways that avoid otherwise likely starvation issues with large transactions.

Intermission

This concludes the first part of this two part post. In the next part we will continue our discussion on the development of Kcas, starting with the addition of a fundamentally new feature to Kcas which turns it into a proper STM implementation.