Feature Parity Series: Statmemprof Returns!

Welcome to part two of our feature parity series! In it, we present returning features that were originally lost when OCaml gained multicore support. The addition of multiple domains means that the underpinning design decisions behind certain features have had to change significantly, and work is ongoing to adapt them and return them to OCaml 5.

One of these features is memory profiling, which, after much theoretical consideration, has been successfully adapted to OCaml 5. Memory profiling is an important tool for developers who want to optimise their programs, and our post today delves into OCaml 5’s statistical memory profiler, statmemprof, and its now multicore-compatible design. Let’s explore the journey to its return!

What is a Memory Profiler?

Developers use memory profilers to understand how their programs use memory. Whether they think it’s using too much, is behaving suspiciously, or want to analyse it for comparison’s sake, attaching a memory profiler lets them see how their program allocates memory and keep track of it when it runs. It sounds straightforward, but this is where the challenges begin!

One of the first hurdles to clear is the sheer volume of allocated memory. Many programs, and in fact, many of the programs that are likely to be interesting from a memory perspective, allocate terabytes of memory over their run time. Running a memory profiler that monitors them all would significantly slow down the entire system. OCaml used to have a memory profiler that monitored all allocations (see spacetime), but it was removed because it was too resource-expensive.

The solution to this first conundrum is to use a statistical memory profiler (the ‘stat’ in Statmemprof). A statistical memory profiler records a random sample of memory allocations in the program. This method still allows users to find allocations that stand out. Large allocations of memory tend to be more noteworthy, and consequently, if you have a program that allocates small and large pieces of memory, you want the random sampler to sample the bigger ones more often.

Statmemprof was added to 4.11 while Multicore OCaml was in development. Hence, the original implementation of Statmemprof did not have to worry about how memory profiling should work with multiple domains. With the arrival of OCaml 5 with its multicore support, this question had to be addressed.

How `Statmemprof` Works With OCaml 5

Memory Allocation in OCaml

There are a few things one needs to wrap one’s head around to understand how statmemprof does its magic. This includes the way OCaml allocates memory with an inline pointer-bump allocator. If you are already familiar with memory allocation in the minor and major heap, jump ahead to the next section!

OCaml needs to be able to allocate millions of objects a second and, therefore, needs very efficient memory allocation. Most programming languages call a function in the language library (such as malloc) that determines which memory to allocate. This process is too slow to work well in OCaml and for many other garbage collected languages such as Haskell, which also use bump-pointer allocators.

In OCaml, part of the total memory available is reserved in what is known as the minor heap. In the minor heap, an allocation register points to the lowest address of allocated memory or to the boundary between what is allocated and free. Say a new object needs 32 bytes of memory: the system subtracts 32 bytes from where the allocation register is pointing and this space is used for the new object. When the minor heap’s garbage collector (GC) runs, it checks which objects can be reclaimed and which need to be kept. Surviving objects are promoted to the major heap, and the allocator register is reset to the start of the minor heap since it is now empty.

The minor heap has a ‘limit’, most commonly set to where the heap’s space ends, that, when reached, triggers a jump into the runtime system. The runtime can then take one of several actions, including garbage collection. This design makes memory allocation in OCaml very fast. Crucially for our topic today, this limit can be used to trigger a number of important events. Signal handling, for example, is achieved by tripping the limit in the minor heap to get into the runtime, which then runs the signal handlers. The runtime decides what actions to take and where to set the limit in the minor heap, allowing it to perform many different behaviours.

OCaml can also bypass the minor heap and allocate objects directly in the major heap. This is useful for very large objects, which tend to live longer and survive the minor heap’s GC anyway. That's a topic for another time.

With this basic overview of how OCaml allocates memory in our back pocket, let’s look at how statmemprof profiles memory in this system.

Statistical Memory Profiling in OCaml

The key to how statmemprof profiles memory lies in how the ‘statistical’ aspect is defined. To sample only a subset of memory allocations we need to define a workflow by which we get a random selection of samples. Since it only profiles every n number of allocations the user can leave the profiler running in the background without introducing significant overhead.

So how does it work? We need to generate a number for both the minor and major heap to help us select the sample we want to profile. We need the number to be random, meaning that every number has an equal probability of being generated. Statmemprof achieves this through statistical sampling using a so-called Bernoulli trial, meaning that it samples every word of memory allocation with the same probability.

Say the event we’re interested in is the allocation of a single word of memory to the minor heap. We have a parameter called ‘lambda’ for any such event, which represents the likelihood that statmemprof will sample that particular event. The random number we get, called a geometric random variable, stands for how many Bernoulli trials for some given lambda (or likelihood). You can also think of it as how long do we wait (how many events happen) before we sample one event.

This choice of distributions is driven by the sampling mechanics in each heap. For the minor heap, we need to know "when is the next sample due?" which is naturally modeled by a geometric distribution - it tells us how many trials (allocations) until we hit our first success (sample). For the major heap, since we're dealing with larger blocks of memory, we need to know "how many samples should we take in N words?" This is naturally modeled by a binomial distribution, as it represents the number of successes (samples) in a fixed number of trials (N words). The geometric distribution is also computationally efficient for triggering the GC mechanism at the right time, while the binomial distribution provides a more systematic way to sample larger memory blocks.

Now, let's imagine we get a random number, say 137. That number is subtracted from the allocation register in the minor heap, and the limit is set there. When the limit is reached, we go into the runtime, and the action we take is to take a memory profile sample. Statmemprof then generates a new number, and the process repeats. The process is the same for the major heap, but we use a binomial random variable instead of a geometric one.

The benefit of statistical memory profiling is that smaller-sized objects in the minor and major heaps are less likely to be sampled since they don’t take up as much space as larger objects. This is good because the larger objects tend to be more interesting from a memory profiling perspective.

What Happens When `Statmemprof` Samples an Object?

Statmemprof was designed to be a flexible mechanism that gives the programmer a lot of choice. There is no hardwired action set up for when statmemprof samples an allocation. Instead, there are a number of actions to choose from left open for users to configure. They include determining the size of the object, whether it came from the minor or major heap, and what the program was doing at the time of the object’s allocation.

When statmemprof samples an allocation it executes a callback (a construct that essentially works like a function) which is provided with details about the allocation and a backtrace. A backtrace refers to the sequence of functions that called a particular function. Backtraces are used to trace backwards from the function that triggered the allocation to the functions that called it, and so on, until it reaches the entry point of the program. What this means for statmemprof is that the API provides enough details for tools like memtrace to generate visual representations of memory use for the user's programs.

There are five different kinds of events that can trigger the callback:

alloc_minor: an object is allocated to the minor heap
alloc_major: an object is allocated to the major heap
promote: an object survives garbage collection and is moved to the major heap
dealloc_minor: an object does not survive garbage collection and is freed from the minor heap
dealloc_major: an object does not survive garbage collection and is freed from the major heap

So, the hypothetical lifecycle of an object could be as follows: it gets stored in the minor heap with alloc_minor. The limit is tripped in the minor heap, and the garbage collector runs. The object survives garbage collection and is moved to the major heap with promote. The garbage collector runs in the major heap, and if the object is not needed anymore, it gets freed with dealloc_major. As an object's lifecycle progresses, statmemprof will execute a callback for each event and a complete picture of it can be built up. Statmemprof is designed to be flexible and configurable, and, for example, users can choose to set the profiler to retain callback information or opt to discard it.

Memtrace

For many users, delving into the code to configure statmemprof would add an undesirable level of complexity to their workflow. The solution is to use tools like Memtrace, a profiling library that uses the statmemprof interface. By building on the statmemprof functionality, these tools enable users to profile memory in the way they want to without having to worry about the specifics of how statmemprof works. Memtrace can accumulate the allocations and callstacks from the program to get a picture of which code locations are responsible for triggering allocations. (Note that, as of writing, the 5.3 compatible version of Memtrace has yet to be released by Jane Street, but work is underway).

Memtrace was created at Jane Street to help them pinpoint memory issues like space leaks. It uses the callback API implemented by statmemprof to record allocation events in the binary format Common Trace Format (CTF). Memtrace also comes with a viewer, a helpful tool that lets developers visualise their programs and see how memory is allocated.

Generating a trace is straightforward, and Luke Maurer from Jane Street outlines the process in a great blog post on their website, and, if you want to learn more about the design of Memtrace, check out this excellent guide.

This is just one example of how restoring statmemprof support brings powerful options to users of OCaml 5. Its features support the creation and implementation of tools that let users manage and understand how their programs use memory in new and detailed ways.

Considerations for Multiple Domains

So how do multiple domains affect the design of a memory profiler? The choices we made reflect our preference after weighing our options, and not necessarily the only 'right' way to approach the problem. Below are some examples of the design choices we made while bringing memory profiling to multiple domains in OCaml:

Let’s say you have two domains running at the same time doing different jobs separately, then one domain starts profiling its memory allocations. Should memory allocated by the other domain be sampled? For us, the answer was no. Behaviour in separate domains should be treated independently of each other.
Say you are in one domain and you start profiling, then, from this domain, you spawn another. Should the allocations in the new domain be profiled? We chose to answer 'yes' to this, since the new domain was created to achieve the work of the original domain.
Should call-backs keep running after the profiler has called stop? In OCaml 4, after stop was called statmemprof would essentially throw away all of its sampled information. In OCaml 5 the user can determine whether to ask the profiler to stop sampling, where statmemprof stops sampling new allocations but keeps the information, or stop and discard where the profiler discards all the information held for that profile. This wasn’t a relevant feature for OCaml 4 since a terminated domain meant the program had ended and statmemprof could just disregard that information. With OCaml 5, longer running memory profiling is more likely, and we need to be able to distinguish between the two stop calls.
For statmemprof, one domain can start a ‘profile’ by calling the start function of statmemprof and sets up all the callbacks and sampling separately from all other domains. In theory, you could apply entirely different profiling tools, like memtrace, in different domains in the same program.
Let’s say you run a program on multiple domains and run a profile on one domain which allocates some objects, samples them, and runs the allocated callbacks. Let’s then suppose that that domain terminates but the profile keeps running (say if another domain is running the same profile) and an allocation callback is promoted in the GC and continues its lifecycle. It is generally the rule that callbacks should be run by the domain that allocated the object, but if that original domain has terminated the callback may be run by a different domain because the object might still be alive on the major heap. When the object is freed and statmemprof would need to run a deallocation callback, it can also run that callback from a different domain if the original domain has been terminated.
Lastly, a lot of work went into synchronisation and ensuring that no domain was ever waiting for statmemprof before being able to continue its jobs. Statmemprof only uses one lock to enforce synchronisation, which occurs when a domain terminates while statmemprof is still running. Its data is put on the orphans list which is protected by a lock. Any other domain can then adopt this data.

These were just some of the decisions that our team made to ensure the profiler worked well for programs with multiple domains, a technically complex challenge with a lot of variables to consider.

Until Next Time!

Multicore statmemprof was developed at Tarides, and we are happy to have brought the tool into the OCaml 5 era. We invite you to use the memory profiler to analyse your own programs. Please provide feedback and raise any issues in the OCaml repo and on the OCaml Discuss forum. You can also contact us directly for support with your multicore code or to get advice on how to take advantage of multicore OCaml.

Curious about how we maintain and restore features to OCaml 5? Read more of our multicore and compiler blog posts, such as compaction, compiler maintenance, and catching data races.

Connect with Tarides online on Bluesky, Mastodon, Threads, and LinkedIn or sign up for our mailing list to stay updated on our latest projects.

Acknowledgements

A huge thank you to Nick Barnes and Tim McGilchrist for their invaluable and extensive input on this post.

Feature Parity Series: Statmemprof Returns!

What is a Memory Profiler?

How `Statmemprof` Works With OCaml 5

Memory Allocation in OCaml

Statistical Memory Profiling in OCaml

What Happens When `Statmemprof` Samples an Object?

Memtrace

Considerations for Multiple Domains

Until Next Time!

Acknowledgements

Open-Source Development

Explore Commercial Opportunities

Stay Updated on OCaml and MirageOS!

Subscription Succesful

Subscription Succesful

Feature Parity Series: Statmemprof Returns!

What is a Memory Profiler?

How Statmemprof Works With OCaml 5

Memory Allocation in OCaml

Statistical Memory Profiling in OCaml

What Happens When Statmemprof Samples an Object?

Memtrace

Considerations for Multiple Domains

Until Next Time!

Acknowledgements

Open-Source Development

Explore Commercial Opportunities

Stay Updated on OCaml and MirageOS!

How `Statmemprof` Works With OCaml 5

What Happens When `Statmemprof` Samples an Object?