Towards Minimal Disk-Usage for Tezos Bakers

by the Tarides storage team on Nov 10th, 2022

Over the last few months, Tarides has focused on designing, prototyping, and integrating a new feature for Tezos bakers: automatic context pruning for rolling and full nodes. This feature will allow bakers to run Tezos with minimal disk usage while continuing to enjoy 12x more responsive operations. The first version has been released with Octez v15. The complete, more optimised context pruning feature will come with Octez v16. We encourage every Tezos baker to upgrade and give feedback.

We have implemented context pruning for rolling and full nodes, which requires ~35GB of disk for storing 6 cycles in the upper layer. In Octez v15, each subsequent pruning run needs an additional 40GB, but that space is recovered when the operation finishes. We plan to remove that extra requirement in Octez v16.

Improve Space Usage with Context Pruning

The Tezos context is a versioned key/value store that associates for each block a view of its ledger state. The versioning uses concepts similar to Git. The current implementation is using irmin as backend and abstracted by the lib_context library.

We have been designing, prototyping, and integrating a new structure for Irmin storage. It is now reorganised into two layers: one upper layer that contains the latest cycles of the blockchain, which are still in use, and a lower layer containing older, frozen data. A new garbage collection feature (GC) periodically restructures the Tezos context by removing unused data in the oldest cycles from the upper layer, where only the data still accessible from the currently live cycles are preserved. The first version of the GC, available in Octez-v15, is optimised for rolling and full nodes and thus does not contain a lower layer. We plan to extend this feature in Octez-v17 to dramatically improve the archive nodes' performance by moving the unused data to the lower layer (more on this below).

Garbage collection and subsequent compression of live data improves disk and kernel cache performance, which enhances overall node performance. Currently, rolling nodes operators must apply a manual cleanup process to release space on the disk by discarding unused data. The manual cleanup is tedious and error-prone. Operators could discard valuable data, have to stop their baker, or try to devise semi-automatic procedures and run multiple bakers to avoid downtime. The GC feature provides rolling nodes operators a fully automated method to clean up the unused data and guarantees that only the unused data is discarded, i.e., all currently used data is preserved.

The GC operation is performed asynchronously with minimal impact on the Tezos node. In the rolling node's case, a GC'd context uses less disk space and has a more stable performance throughout, as the protocol operations (such as executing smart contracts or computing baking rewards) only need data from the upper layer. As such, the nodes that benefit from the store's layered structure don't need to use the manual snapshot export/import—previously necessary when the disk’s context got too big. In the future, archive nodes’ performance will improve because only the upper layer is needed to validate recent blocks. This means archive nodes can bake as reliably as rolling nodes.

Tezos Storage, in a Nutshell

The Tezos blockchain uses Irmin as the main storage component. Irmin is a library to design Git-like storage systems. It has many backends, and one of them is irmin-pack, which is optimised for the Tezos use case. In the followings, we focus on the main file used to store object data: the store pack file.

Pack file: Tezos state is serialised as immutable functional objects. These objects are marshalled in a append-only pack file, one after the other. An object can contain pointers to the file's earlier (but not later!) objects. Pointers to an earlier object are typically represented by the offset (position) of the earlier object in the pack file. The pack file is append-only: existing objects are never updated.

J7N0pil

An Irmin pack file as a sequence of objects: | obj | obj | obj | ...

Commit objects: Some of the objects in the pack file are commit objects. A commit, together with the objects reachable from that commit, represents the state associated to a Tezos' block. The Tezos node only needs the last commit to process new blocks, but bakers will need a lot more commits to compute baking rewards. Objects not reachable from these commits can are unreachable or dead objects.

DQJJLll

The data-structure (mental model) representation of the pack file vs. its physical representation.

Archive nodes and rolling nodes: There are different types of Tezos nodes. An archive node stores the complete blockchain history from the genesis block. Currently, this is over 2 million blocks. Roughly speaking, a block corresponds to a commit. A rolling node stores only the last n blocks, where n is chosen to keep the total disk usage within some bounds. This may be as small as 5 (or even less) or as large as 40,000 or more. Another type of node is the "full node," which is between an archive node and a rolling node.

Rolling nodes, disk space usage: The purpose of the rolling node is to keep resource usage, particularly disk space, bounded by only storing the last blocks. However, the current implementation does not achieve this aim. As rolling nodes execute, the pack file grows larger and larger, and no old data is discarded. To get around this problem, node operators periodically export snapshots of the current blockchain state from the node, delete the old data, and then import the snapshot state back.

Problem summary: The main problem we want to avoid is Tezos users having to periodically export and import the blockchain state to keep the disk usage of the Tezos node bounded. Instead, we want to perform context pruning via automatic garbage collection of unreachable objects. Periodically, a commit should be chosen as the GC root, and objects constructed before the commit that are not reachable from the commit should be considered dead branches, removed from the pack store, and the disk space reclaimed. The problem is that with the current implementation of the pack file, which is just an ordinary file, it is impossible to "delete" regions corresponding to dead objects and reclaim the space.

Automatised Garbage Collection Solution

Consider the following pack file, where the GC-commit object has been selected as the commit root for garbage collection:

ySiXa1r

Objects that precede the commit root are either reachable from the commit (by following object references from it) or not. For the unreachable objects, we want to reclaim the disk space. For reachable objects, we need to be able to continue to access them via their offset in the pack file.

The straightforward solution is to implement the pack file using two other data structures: the suffix and the prefix. The suffix file contains the root commit object (GC-commit) and the live objects represented by all bytes following the offset of GC-commit in the pack file. The prefix file contains all the objects reachable from the root commit, indexed by their offset. Note that the reachable objects appear earlier in the pack file than the root commit.

QVeXtOB

The layered structure of the pack file with prefix+suffix as the upper layer.

Reading from the pack file is then simulated in an obvious way: if the offset is for the GC-commit, or later, we read from the suffix file, and otherwise, we lookup the offset in the prefix and return the appropriate object. We only access the reachable objects in the prefix via their offset. We replace the Irmin pack file with these two data structures. Every time we perform garbage collection from a given GC-commit, we create the next versions of the prefix and suffix data-structures and switch from the current version to the next version by deleting the old prefix and suffix to reclaim disk space. Creating the next versions of the prefix and suffix data-structures is potentially expensive. Hence, we implement these steps in a separate process, the GC worker, with minimal impact on the running Tezos node.

Caveat: Following Git, a commit will typically reference its parent commit, which will then reference its parent, and so on. Clearly, if we used these references to calculate object reachability, all objects would remain reachable forever. However, this is not what we want, so when calculating the set of reachable objects for a given commit, we ignore the references from a commit to its parent commit.

The prefix Data-Structure

The prefix is a persistent data-structure that implements a map from the offsets in pack file to objects (the marshalled bytes representing an object). In our scenario, the GC worker creates the prefix, which is then read-only for the main process. Objects are never mutated or deleted from the prefix file. In this setting, a straightforward implementation of an object store suffices: we store reachable objects in a data file and maintain a persistent (int → int) map from "offset in the original pack file" to "offset in the prefix file."

Terminology: We introduce the term "virtual offset" for "offset in the original pack file" and the term "real offset" for "offset in the prefix file." Thus, the map outlining virtual offset to real offset is made persistent as the mapping file.

Example: Consider the following, where the pack file contains reachable objects o1 .. o10, (with virtual offsets v1 .. v10, respectively):

JKWA4ff

Note that the objects o1 .. o10 are scattered throughout the pack file where they appear in ascending order (i.e., v1 < .. < v10). The prefix file contains the same objects but with different "real" offsets r1..r10, as now the objects o1 .. o10 appear one after the other. The mapping needs to contain an entry (v1 → r1) for object o1 (and similarly for the other objects) to relate the virtual offset in the pack file with the real offset in the prefix file.

To read from "virtual offset v3" (say), we use the map to retrieve the real offset in the prefix file (i.e., r3) and then read the object data from that position.

Asynchronous Implementation

Tezos Context pruning is performed periodically. We want each round of context pruning to take place asynchronously with minimal impact on the main Tezos node. For this reason, when a commit is chosen as the GC root, we fork a worker process to construct the next prefix and suffix data structures. When the GC worker terminates, the main process handles worker termination. It switches from the current prefix+suffix to the next and continues operation. This switch takes place almost instantaneously. The hard work is done in the worker process as depicted next:

lob23OH

Read-only Tezos nodes: In addition to the main Tezos read/write node that accesses the pack store, several read-only nodes also access the pack store (and other Irmin data files) in read-only mode. These must be synchronised when the switch is made from the current prefix+suffix to the next prefix+suffix. This synchronisation makes use of a (new) single control file.

Further Optimisations in the Octez Storage Layer

The context pruning via automatic garbage collection performs well and within the required constraints. However, it is possible to make further efficiency improvements. We next describe some potential optimisations we plan to work on over the next months.

Resource-aware garbage collection:

The GC worker intensively uses disk, memory, and OS resources. For example, the disk and memory are doubled in size during the asynchronous execution of the GC worker. We plan to improve on this by more intelligent use of resources. For example, computing the reachable objects during the GC involves accessing earlier objects, using a lot of random-access reads, with unpredictable latency. A more resource-aware usage of the file system ensures that the objects are visited (as much as possible) in the order of increased offset on disk. This takes advantage of the fact that sequential file access is much quicker and predictable than accessing the file randomly. The work on context pruning via a resource-aware garbage collection is planned to be included in Octez v16.

Retaining older objects for archive nodes:

Archive nodes contain the complete blockchain history, starting from the genesis block. This results in a huge store pack file, many times larger than the kernel’s page cache. Furthermore, live objects are distributed throughout this huge file, which makes it difficult for OS caching to work effectively. As a result, as the store becomes larger, the archive node becomes slower.

In previous prototypes of the layered store, the design also included a "lower" layer. For archive nodes, the lower layer contained all the objects before the most recent GC-commit, whether they were reachable or not. The lower layer was effectively the full segment of the pack file before the GC commit root.

One possibility with the new layout introduced by the GC is to retain the lower layer whilst still sticking with the prefix and mapping files approach and preferentially reading from the prefix where possible. The advantage (compared with just keeping the full pack file) is that the prefix is a dense store of reachable objects, improving OS disk caching and the snapshot export performance for recent commits. In addition, the OS can preferentially cache the prefix&mapping, which enhances general archive node performance compared with trying to cache the huge pack file. As baking operations only need to access these cached objects, their performance will be more reliable and thus will reduce endorsement misses drastically. However, some uses of the archive node, such as responding to RPC requests concerning arbitrary blocks, would still access the lower layer, so they will not benefit from this optimisation. The work on improving performance for archive nodes is planned for Octez v17.

Conclusion

With the context pruning feature integrated, Tezos rolling and full nodes accurately maintain all and only used storage data in a performant, compact, and efficient manner. Bakers will benefit from these changes in Octez v15, while the feature will be included in archive nodes in Octez v17.