Kcas: Building a Lock-Free STM for OCaml (2/2)

by Vesa Karvonen on Aug 10th, 2023

This is the follow-up post continuing the discussion of the development of Kcas. Part 1 discussed the development done on the library to improve performance and add a transaction mechanism that makes it easy to compose atomic operations without really adding more expressive power.

In this part we'll discuss adding a fundamentally new feature to Kcas that makes it into a proper STM implementation.

Get Busy Waiting

If shared memory locations and transactions over them essentially replace traditional mutexes, then one might ask what replaces condition variables. It is very common in concurrent programming for threads to not just want to avoid stepping on each other's toes, or the I of ACID, but to actually prefer to follow in each other's footsteps. Or, to put it more technically, wait for events triggered or data provided by other threads.

Following the approach introduced in the paper Composable Memory Transactions, I implemented a retry mechanism that allows a transaction to essentially wait on arbitrary conditions over the state of shared memory locations. A transaction may simply raise an exception, Retry.Later, to signal to the commit mechanism that a transaction should only be retried after another thread has made changes to the shared memory locations examined by the transaction.

A trivial example would be to convert a non-blocking take on a queue to a blocking operation:

let take_blocking ~xt queue =
  match Queue.Xt.take_opt ~xt queue with
  | None -> Retry.later ()
  | Some elem -> elem

Of course, the Queue provided by kcas_data already has a blocking take which essentially results in the above implementation.

Perhaps the main technical challenge in implementing a retry mechanism in multicore OCaml is that it should perform blocking in a scheduler friendly manner such that other fibers, as in Eio, or tasks, as in Domainslib, are not prevented from running on the domain while one of them is blocked. The difficulty with that is that each scheduler potentially has its own way for suspending a fiber or waiting for a task.

To solve this problem such that we can provide an updated and convenient blocking experience, we introduced a library that provides a domain-local-await mechanism, whose interface is inspired by Arthur Wendling's proposal for the Saturn library. The idea is simple. Schedulers like Eio and Domainslib install their own implementation of the blocking mechanism, stored in a domain local variable, and then libraries like Kcas can obtain the mechanism to block in a scheduler friendly manner. This allows blocking abstractions to not only work on one specific scheduler, but also allows blocking abstractions to work across different schedulers.

Another challenge is the desire to support both conjunctive and disjunctive combinations of transactions. As explained in the paper Composable Memory Transactions, this in turn requires support for nested transactions. Consider the following attempt at a conditional blocking take from a queue:

let non_nestable_take_if ~xt predicate queue =
  let x = Queue.Xt.take_blocking ~xt queue in
  if not (predicate x) then
    Retry.later ();
  x

If one were to try to use the above to take an element from the first of two queues

Xt.first [
  non_nestable_take_if predicate queue_a;
  non_nestable_take_if predicate queue_b;
]

one would run into the following problem: while only a value that passes the predicate would be returned, an element might be taken from both queues.

To avoid this problem, we need a way to roll back changes recorded by a transaction attempt. The way Kcas supports this is via an explicit scoping mechanism. Here is a working (nestable) version of conditional blocking take:

let take_if ~xt predicate queue =
  let snap = Xt.snapshot ~xt in
  let x = Queue.Xt.take_blocking ~xt queue in
  if not (predicate x) then
    Retry.later (Xt.rollback ~xt snap);
  x

First a snapshot of the transaction log is taken and then, in case the predicate is not satisfied, a rollback to the snapshot is performed before signaling a retry. The obvious disadvantage of this kind of explicit approach is that it requires more care from the programmer. The advantage is that it allows the programmer to explicitly scope nested transactions and perform rollbacks only when necessary and in a more fine-tuned manner, which can allow for better performance.

With properly nestable transactions one can express both conjunctive and disjunctive compositions of conditional transactions.

As an aside, having talked about the splay tree a few times in my previous post, I should mention that the implementation of the rollback operation using the splay tree also worked out surprisingly nicely. In the general case, a rollback may have an effect on all accesses to shared memory locations recorded in a transaction log. This means that, in order to support rollback, worst case linear time cost in the number of locations accessed seems to be the minimum — no matter how transactions might be implemented. A single operation on a splay tree may already take linear time, but it is also possible to take advantage of the tree structure and sharing of the immutable spine of splay trees and stop early as soon as the snapshot and the log being rolled back are the same.

Will They Come

Blocking or retrying a transaction indefinitely is often not acceptable. The transaction mechanism with blocking is actually already powerful enough to support timeouts, because a transaction will be retried after any location accessed by the transaction has been modified. So, to have timeouts, one could create a location, make it so that it is changed when the timeout expires, and read that location in the transaction to determine whether the timeout has expired.

Creating, checking, and also cancelling timeouts manually can be a lot of work. For this reason Kcas was also extended with direct support for timeouts. To perform a transaction with a timeout one can simply explicitly specify a timeoutf in seconds:

let try_take_in ~seconds queue =
  Xt.commit ~timeoutf:seconds { tx = Queue.Xt.take_blocking queue }

Internally Kcas uses the domain-local-timeout library for timeouts. The OCaml standard library doesn't directly provide a timeout mechanism, but it is a typical service provided by concurrent schedulers. Just like with the previously mentioned domain local await, the idea with domain local timeout is to allow libraries like Kcas to tap into the native mechanism of whatever scheduler is currently in use and to do so conveniently without pervasive parameterisation. More generally this should allow libraries like Kcas to be scheduler agnostic and help to avoid duplication of effort.

Hollow Man

Let's recall the features of Kcas transactions briefly.

First of all, passing the transaction ~xt through the computation allows sequential composition of transactions:

let bind ~xt a b =
  let x = a ~xt in
  b ~xt x

This also gives conjunctive composition as a trivial consequence:

let pair ~xt a b =
  (a ~xt, b ~xt)

Nesting, via snapshot and rollback, allows conditional composition:

let if_else ~xt predicate a b =
  let snap = Xt.snapshot ~xt in
  let x = a ~xt in
  if predicate x then
    x
  else begin
    Xt.rollback ~xt snap;
    b ~xt
  end

Nesting combined with blocking, via the Retry.Later exception, allows disjunctive composition

let or_else ~xt a b =
  let snap = Xt.snapshot ~xt in
  match a ~xt with
  | x -> x
  | exception Retry.Later ->
    Xt.rollback ~xt snap;
    b ~xt

of blocking transactions, which is also supported via the first combinator.

What is Missing?

The limits of my language mean the limits of my world. — Ludwig Wittgenstein

The main limitation of transactions is that they are invisible to each other. A transaction does not directly modify any shared memory locations and, once it does, the modifications appear as atomic to other transactions and outside observers.

The mutual invisibility means that rendezvous between two (or more) threads cannot be expressed as a pair of composable transactions. For example, it is not possible to implement synchronous message passing as can be found e.g. in Concurrent ML, Go, and various other languages and libraries, including zero capacity Eio Streams, as simple transactions with a signature such as follows:

module type Channel = sig
  type 'a t
  module Xt : sig
    val give : xt:'x Xt.t -> 'a t -> 'a -> unit
    val take : xt:'x Xt.t -> 'a t -> 'a
  end
end

Languages such as Concurrent ML and Go allow disjunctive composition of such synchronous message passing operations and some other libraries even allow conjunctive, e.g. CHP, or even sequential composition, e.g. TE and Reagents, of such message passing operations.

Although the above Channel signature is unimplementable, it does not mean that one could not implement a non-compositional Channel

module type Channel = sig
  type 'a t
  val give : 'a t -> 'a -> unit
  val take : 'a t -> 'a
end

or implement a compositional message passing model that allows such operations to be composed. Indeed, both the CHP and TE libraries were implemented on top of Software Transactional Memory with the same fundamental invisibility of transactions. In other words, it is possible to build a new composition mechanism, distinct from transactions, by using transactions. To allow such synchronisation between threads requires committing multiple transactions.

Torn Reads

The k-CAS-n-CMP algorithm underlying Kcas ensures that it is not possible to read uncommitted changes to shared memory locations and that an operation can only commit successfully after all of the accesses taken together have been atomic, i.e. strictly serialisable or both linearisable and serialisable in database terminology. These are very strong guarantees and make it much easier to implement correct concurrent algorithms.

Unfortunately, the k-CAS-n-CMP algorithm does not prevent one specific concurrency anomaly. When a transaction reads multiple locations, it is possible for the transaction to observe an inconsistent state when other transactions commit changes between reads of different locations. This is traditionally called read skew in database terminology. Having observed such an inconsistent state, a Kcas transaction cannot succeed and must be retried.

Even though a transaction must retry after having observed read skew, unless taken into account, read skew can still cause serious problems. Consider, for example, the following transaction:

let unsafe_subscript ~xt array index =
  let a = Xt.get ~xt array in
  let i = Xt.get ~xt index in
  a.(i)

The assumption is that the array and index locations are always updated atomically such that the subscript operation should be safe. Unfortunately due to read skew the array and index might not match and the subscript operation could result in an "index out of bounds" exception.

Even more subtle problems are possible. For example, a balanced binary search tree implementation using rotations can, due to read skew, be seen to have a cycle. Consider the below diagram. Assume that a lookup for node 2 has just read the link from node 3 to node 1. At that point another transaction commits a rotation that makes node 3 a child of node 1. As the lookup reads the link from node 1 it leads back to node 3 creating a cycle.

Tree rotations

There are several ways to deal with these problems. It is, of course, possible to use ad hoc techniques, like checking invariants manually, within transactions. The Kcas library itself addresses these problems in a couple of ways.

First of all, Kcas performs periodic validation of the entire transaction log when an access, such as get or set, of a shared memory location is made through the transaction log. It would take quadratic time to validate the entire log on every access. To avoid changing the time complexity of transactions, the number of accesses between validations is doubled after each validation.

Periodic validation is an effective way to make loops that access shared memory locations, such as the lookup of a key from a binary search tree, resistant against read skew. Such loops will eventually be aborted on some access and will then be retried. Periodic validation is not effective against problems that might occur due to non-transactional operations made after reading inconsistent state. For those cases an explicit validate operation is provided that can be used to validate that the accesses of particular locations have been atomic:

let subscript ~xt array index =
  let a = Xt.get ~xt array in
  let i = Xt.get ~xt index in
  (* Validate accesses after making them: *)
  Xt.validate ~xt index;
  Xt.validate ~xt array;
  a.(i)

It is entirely fair to ask whether it is acceptable for an STM mechanism to allow read skew. A candidate correctness criterion for transactional memory called "opacity", introduced in the paper On the correctness of transactional memory, does not allow it. The trade-off is that the known software techniques to provide opacity tend to introduce a global sequential bottleneck, such as a global transaction version number accessed by every transaction, that can and will limit scalability especially when transactions are relatively short, which is usually the case.

At the time of writing this there are several STM implementations that do not provide opacity. The current Haskell STM implementation, for example, introduced in 2005, allows similar read skew. In Haskell, however, STM is implemented at the runtime level and transactions are guaranteed to be pure by the type system. This allows the Haskell STM runtime to validate transactions when switching threads. Nevertheless there have been experiments to replace the Haskell STM using algorithms that provide opacity as described in the paper Revisiting software transactional memory in Haskell, for example. The Scala ZIO STM also allows read skew. In his talk Transactional Memory in Practice, Brett Hall describes their experience in using a STM in C++ that also allows read skew.

It is not entirely clear how problematic it is to have to account for the possibility of read skew. Although I expect to see read skew issues in the future, the relative success of the Haskell STM would seem to suggest that it is not necessarily a show stopper. While advanced data structure implementations tend to have intricate invariants and include loops, compositions of transactions using such data structures, like the LRU cache implementation, tend to be loopless and relatively free of such invariants and work well.

Tomorrow May Come

At the time of writing this, the kcas and kcas_data packages are still marked experimental, but are very close to being labeled 1.0.0. The core Kcas library itself is more or less feature complete. The Kcas data library, by its nature, could acquire new data structure implementations over time, but there is one important feature missing from Kcas data — a bounded queue.

It is, of course, possible to simply compose a transaction that checks the length of a queue. Unfortunately that would not perform optimally, because computing the exact length of a queue unavoidably requires synchronisation between readers and writers. A bounded queue implementation doesn't usually need to know the exact length — it only needs to have a conservative approximation of whether there is room in the queue and then the computation of the exact length can be avoided much of the time. Ideally the default queue implementation would allow an optional capacity to the specified. The challenge is to implement the queue without making it any slower in the unbounded case.

Less importantly the Kcas data library currently does not provide an ordered map nor a priority queue. Those serve use cases that are not covered by the current selection of data structures. For an ordered map something like a WAVL tree could be a good starting point for a reasonably scalable implementation. A priority queue, on the other hand, is more difficult to scale, because the top element of a priority queue might need to be examined or even change on every mutation, which makes it a sequential bottleneck. On the other hand, updating elements far from the top shouldn't require much synchronisation. Some sort of two level scheme like a priority queue of per domain priority queues might provide best of both worlds.

But Why?

If you look at a typical textbook on concurrent programming it will likely tell you that the essence of concurrent programming boils down to two (or three) things:

independent sequential threads of control, and
mechanisms for threads to communicate and synchronise.

The first bullet on that list has received a lot of focus in the form of libraries like Eio and Domainslib that utilise OCaml's support for algebraic effects. Indeed, the second bullet is kind of meaningless unless you have threads. However, that does not make it less important.

Programming with threads is all about how threads communicate and synchronise with each other.

A survey of concurrent programming techniques could easily fill an entire book, but if you look at most typical programming languages, they provide you with a plethora of communication and synchronisation primitives such as

atomic operations,
spin locks,
barriers and count down latches,
semaphores,
mutexes and condition variables,
message queues,
other concurrent collections,
and more.

The main difficulty with these traditional primitives is their relative lack of composability. Every concurrency problem becomes a puzzle whose solution is some ad hoc combination of these primitives. For example, given a concurrent thread safe stack and a queue it may be impossible to atomically move an element from the stack to the queue without wrapping both behind some synchronisation mechanism, which also likely reduces scalability.

There are also some languages based on asynchronous message passing with the ability to receive multiple messages selectively using both conjunctive and disjunctive patterns. A few languages are based on rendezvous or synchronous message passing and offer the ability to disjunctively and sometimes also conjunctively select between potential communications. I see these as fundamentally different from the traditional primitives as the number of building blocks is much smaller and the whole is more like unified language for solving concurrency problems rather than just a grab bag of non-composable primitives. My observation, however, has been that these kind of message passing models are not familiar to most programmers and can be challenging to program with.

As an aside, why should one care about composability? Why would anyone care about being able to e.g. disjunctively either pop an element from a stack or take an element from a queue, but not both, atomically? Well, it is not about stacks and queues, those are just examples. It is about modularity and scalability. Being able to, in general, understand independently developed concurrent abstractions on their own and to also combine them to form effective and efficient solutions to new problems.

Another approach to concurrent programming is transactions over mutable data structures whether in the form of databases or Software Transactional Memory (STM). Transactional databases, in particular, have definitely proven to be a major enabler. STM hasn't yet had a similar impact. There are probably many reasons for that. One probable reason is that many languages already offered a selection of familiar traditional primitives and millions of lines of code using those before getting STM. Another reason might be that attempts to provide STM in a form where one could just wrap any code inside an atomic block and have it work perfectly proved to be unsuccessful. This resulted in many publications and blog posts, e.g. A (brief) retrospective on transactional memory, discussing the problems resulting from such doomed attempts and likely contributed to making STM seem less desirable.

However, STM is not without some success. More modest, and more successful, approaches either strictly limit what can be performed atomically or require the programmer to understand the limits and program accordingly. While not a panacea, STM provides both composability and a relatively simple and familiar programming model based on mutable shared memory locations.

Crossroads

Having just recently acquired the ability to have multiple domains running in parallel, OCaml is in a unique position. Instead of having a long history of concurrent multicore programming we can start afresh.

What sort of model of concurrent programming should OCaml offer?

One possible road for OCaml to take would be to offer STM as the go-to approach for solving most concurrent programming problems.

Until Next Time

I've had a lot of fun working on Kcas. I'd like to thank my colleagues for putting up with my obsession to work on it. I also hope that people will find Kcas and find it useful or learn something from it!