
Feature Parity Series: Statmemprof Returns!

Communications Officer
Welcome to part two of our feature parity series! In it, we present returning features that were originally lost when OCaml gained multicore support. The addition of multiple domains means that the underpinning design decisions behind certain features have had to change significantly, and work is ongoing to adapt them and return them to OCaml 5.
One of these features is memory profiling, which, after much theoretical consideration, has been successfully adapted to OCaml 5. Memory profiling is an important tool for developers who want to optimise their programs, and our post today delves into OCaml 5’s statistical memory profiler, statmemprof
, and its now multicore-compatible design. Let’s explore the journey to its return!
What is a Memory Profiler?
Developers use memory profilers to understand how their programs use memory. Whether they think it’s using too much, is behaving suspiciously, or want to analyse it for comparison’s sake, attaching a memory profiler lets them see how their program allocates memory and keep track of it when it runs. It sounds straightforward, but this is where the challenges begin!
One of the first hurdles to clear is the sheer volume of allocated memory. Many programs, and in fact, many of the programs that are likely to be interesting from a memory perspective, allocate terabytes of memory over their run time. Running a memory profiler that monitors them all would significantly slow down the entire system. OCaml used to have a memory profiler that monitored all allocations (see spacetime
), but it was removed because it was too resource-expensive.
The solution to this first conundrum is to use a statistical memory profiler (the ‘stat’ in Statmemprof). A statistical memory profiler monitors a random sample of memory allocations in the program. This method still allows users to find allocations that stand out. Large allocations of memory tend to be more noteworthy, and consequently, if you have a program that allocates small and large pieces of memory, you want the random sampler to sample the bigger ones more often.
Implementing this solution first brought Statmemprof to OCaml 4, but that still left the multicore issue, which, apart from making things generally tricky, required the developers to make some key decisions about how memory profiling should work with multiple domains.
How Statmemprof
Works With OCaml 5
Memory Allocation in OCaml
There are a few things one needs to wrap one’s head around to understand how statmemprof
does its magic. This includes the way OCaml allocates memory with an inline pointer-bump allocator. If you are already familiar with memory allocation in the minor and major heap, jump ahead to the next section!
OCaml needs to be able to allocate millions of objects a second and, therefore, needs very efficient memory allocation. Most programming languages call a function in the language library (such as malloc
) that determines which memory to allocate. This process is too slow to work well in OCaml and for many other garbage collected languages such as Haskell, which also use bump-pointer allocators.
In OCaml, a large part of the total memory available is reserved in what is known as the minor heap. In the minor heap, an allocation register points to the lowest address of allocated memory or to the boundary between what is allocated and free. Say a new object needs 32 bytes of memory: the system subtracts 32 bytes from where the allocation register is pointing and this space is used for the new object. When the minor heap’s garbage collector (GC) runs, it checks which objects can be deleted and which need to be kept. Surviving objects are promoted to the major heap, and the allocator register is reset to the start of the minor heap since it is now empty.
The minor heap has a ‘limit’, most commonly set to where the heap’s space ends, that, when reached, triggers a jump into the runtime system. The runtime can then take one of several actions, including garbage collection. This design makes memory allocation in OCaml very fast. Crucially for our topic today, this limit can be used to trigger a number of important events. Signal handling, for example, is achieved by tripping the limit in the minor heap to get into the runtime, which then runs the signal handlers. The runtime decides what actions to take and where to set the limit in the minor heap, allowing it to perform many different behaviours.
OCaml can also bypass the minor heap and allocate objects directly in the major heap. This is useful for very large objects, which tend to live longer and survive the minor heap’s GC anyway. That's a topic for another time.
With this basic overview of how OCaml allocates memory in our back pocket, let’s look at how statmemprof
profiles memory in this system.
Statistical Memory Profiling in OCaml
The key to how statmemprof
profiles memory lies in how the ‘statistical’ aspect is defined. To sample only a subset of memory allocations we need to define a workflow by which we get a random selection of samples. Since it only profiles every n
number of allocations the user can leave the profiler running in the background without introducing significant overhead.
So how does it work? We need to generate a number for both the minor and major heap to help us select the sample we want to profile. We need the number to be random, meaning that every number has an equal probability of being generated. Statmemprof
achieves this through statistical sampling using a so-called Bernoulli trial, meaning that it samples every word of memory allocation with the same probability.
Say the event we’re interested in is the allocation of a single word of memory to the minor heap. We have a parameter called ‘lambda’ for any such event, which represents the likelihood that statmemprof
will sample that particular event. The random number we get, called a geometric random variable, stands for how many Bernoulli trials for some given lambda (or likelihood). You can also think of it as how long do we wait (how many events happen) before we sample one event.
This choice of distributions is driven by the sampling mechanics in each heap. For the minor heap, we need to know "when is the next sample due?" which is naturally modeled by a geometric distribution - it tells us how many trials (allocations) until we hit our first success (sample). For the major heap, since we're dealing with larger blocks of memory, we need to know "how many samples should we take in N words?" This is naturally modeled by a binomial distribution, as it represents the number of successes (samples) in a fixed number of trials (N words). The geometric distribution is also computationally efficient for triggering the GC mechanism at the right time, while the binomial distribution provides a more systematic way to sample larger memory blocks.
Now, let's imagine we get a random number, say 137. That number is subtracted from the allocation register in the minor heap, and the limit is set there. When the limit is reached, we go into the runtime, and the action we take is to take a memory profile sample. Statmemprof
then generates a new number, and the process repeats. The process is the same for the major heap, but we use a binomial random variable instead of a geometric one.
The benefit of statistical memory profiling is that smaller-sized objects in the minor and major heaps are less likely to be sampled since they don’t take up as much space as larger objects. This is good because the larger objects tend to be more interesting from a memory profiling perspective.
What Happens When Statmemprof
Samples an Object?
Statmemprof
was designed to be a flexible mechanism that gives the programmer a lot of choice. There is no hardwired action set up for when statmemprof
samples an allocation. Instead, there are a number of actions to choose from left open for users to configure. They include determining the size of the object, whether it came from the minor or major heap, and what the program was doing at the time of the object’s allocation.
When statmemprof samples an allocation it executes a callback (a construct that essentially works like a function) which is provided with details about the allocation and a backtrace. A backtrace refers to the sequence of functions that called a particular function. Backtraces are used to trace backwards from the function that triggered the allocation to the functions that called it, and so on, until it reaches the entry point of the program. What this means for statmemprof is that the API provides enough details for tools like memtrace
to generate visual representations of memory use for the user's programs.
There are five different kinds of events that can trigger the callback:
alloc_minor
: an object is allocated to the minor heapalloc_major
: an object is allocated to the major heappromote
: an object survives garbage collection and is moved to the major heapdealloc_minor
: an object does not survive garbage collection and is freed from the minor heapdealloc_major
: an object does not survive garbage collection and is freed from the major heap
So, the hypothetical lifecycle of an object could be as follows: it gets stored in the minor heap with alloc_minor
. The limit is tripped in the minor heap, and the garbage collector runs. The object survives garbage collection and is moved to the major heap with promote
. The garbage collector runs in the major heap, and if the object is not needed anymore, it gets freed with dealloc_major
. As an object's lifecycle progresses, statmemprof will execute a callback for each event and a complete picture of it can be built up. Statmemprof
is designed to be flexible and configurable, and, for example, users can choose to set the profiler to retain callback information or opt to discard it.
Memtrace
For many users, delving into the code to configure statmemprof
would add an undesirable level of complexity to their workflow. The solution is to use tools like Memtrace, a profiling library that uses the statmemprof
interface. By building on the statmemprof
functionality, these tools enable users to profile memory in the way they want to without having to worry about the specifics of how statmemprof
works. Memtrace can accumulate the allocations and callstacks from the program to get a picture of which code locations are responsible for triggering allocations. (Note that, as of writing, the 5.3 compatible version of Memtrace has yet to be released by JaneStreet, but work is underway).
Memtrace was created at Jane Street to help them pinpoint memory issues like space leaks. It uses the callback API implemented by statmemprof
to record allocation events in the binary format Common Trace Format (CTF). Memtrace also comes with a viewer, a helpful tool that lets developers visualise their programs and see how memory is allocated.
Generating a trace is straightforward, and Luke Maurer from Jane Street outlines the process in a great blog post on their website, and, if you want to learn more about the design of Memtrace, check out this excellent guide.
This is just one example of how restoring statmemprof
support brings powerful options to users of OCaml 5. Its features support the creation and implementation of tools that let users manage and understand how their programs use memory in new and detailed ways.
Considerations for Multiple Domains
So how do multiple domains affect the design choices for a memory profiler? Let’s take a look at some examples:
- Let’s say you have two domains running at the same time doing different jobs separately, then one domain starts profiling its memory allocations. Should memory allocated by the other domain be sampled? The answer is: No! Behaviour in separate domains should be treated independently of each other.
- Say you are in one domain and you start profiling, then, from this domain, you spawn another. Should the allocations in the new domain be profiled? The answer is: Yes! Because the new domain was created to achieve the work of the original domain.
- In the multi-domain world, one domain can start a ‘profile’ by calling the start function of
statmemprof
and sets up all the callbacks and sampling separately from all other domains. In theory, you could apply entirely different profiling tools, likememtrace
, in different domains in the same program. - Let’s say you run a program on multiple domains and run a profile on one domain which allocates some objects, samples them, and runs the allocated callbacks. Let’s then suppose that that domain terminates but the profile keeps running (say if another domain is running the same profile) and an allocation callback is promoted in the GC and continues its lifecycle. It is generally the rule that callbacks should be run by the domain that allocated the object, but if that original domain has terminated the callback may be run by a different domain because the object might still be alive on the major heap. When the object is freed and
statmemprof
would need to run a deallocation callback, it can also run that callback from a different domain if the original domain has been terminated. - Should call-backs keep running after the profiler has called
stop
? In OCaml 4, afterstop
was calledstatmemprof
would essentially throw away all of its sampled information. In OCaml 5 the user can determine whether to ask the profiler to stop sampling, wherestatmemprof
stops sampling new allocations but keeps the information, or stop and discard where the profiler discards all the information held for that profile. This wasn’t a relevant feature for OCaml 4 since a terminated domain meant the program had ended andstatmemprof
could just disregard that information. With OCaml 5, longer running memory profiling is more likely, and we need to be able to distinguish between the twostop
calls. - Lastly, a lot of work went into synchronisation and ensuring that no domain was ever waiting for
statmemprof
before being able to continue its jobs.Statmemprof
only uses one lock to enforce synchronisation, which occurs when a domain terminates whilestatmemprof
is still running. Its data is put on the orphans list which is protected by a lock. Any other domain can then adopt this data.
Are you using statmemprof
? Please provide feedback and raise any issues in the OCaml repo and on the OCaml Discuss forum.
Until Next Time!
Curious about how we maintain and restore features to OCaml 5? Read more of our multicore and compiler blog posts, such as compaction, compiler maintenance, and catching data races.
Connect with Tarides online on Bluesky, Mastodon, Threads, and LinkedIn or sign up for our mailing list to stay updated on our latest projects.
Acknowledgements
A huge thank you to Nick Barnes and Tim McGilchrist for their invaluable and extensive input on this post.
Open-Source Development
Tarides champions open-source development. We create and maintain key features of the OCaml language in collaboration with the OCaml community. To learn more about how you can support our open-source work, discover our page on GitHub.
Stay Updated on OCaml and MirageOS!
Subscribe to our mailing list to receive the latest news from Tarides.
By signing up, you agree to receive emails from Tarides. You can unsubscribe at any time.