Garbage collection for Irmin
Mentor: Clément Pascutto
Location: Tarides office, Paris
Irmin is an OCaml library for building mergeable, branchable distributed data stores. It is built on the same principles as Git.
Irmin data stores currently do not provide garbage collection, which can be an issue for applications that store a large amount of data. The goal of this internship is to provide data recollection:
- Unreachable blocks in the store should be collected
- Very old objects might need to be removed
For simplicity, (2) can be transformed into (1) by rebasing the history. A downside with this is that two rebased stores might not be able to sync anymore if they pruned different parts of their history.
Solving (2) without transforming it into (1) means supporting partial fetch/push, which the Git protocol doesn't handle very well.
The first part of the internship will consist of expanding the Irmin API to allow proper store manipulations:
Irmin.Typeshould have an explicit combinator for internal keys. This is useful for the GC to follow all the links, even the ones between leaf objects.
CONTENT_ADDRESSABLE_STOREshould expose a
deleteoperation. This should be atomic.
This second part will consist in an exploratory work on GC strategies and their
implementations in Irmin. A first experiment can consist in a "stop-the-world"
GC, where all concurrent operations are stopped while the GC is scanning the
graph of objects (similar to
git gc). Of course, this is not an option for
interactive applications where latency is critical, so other strategies will
have to be proposed and tested, e.g. a reference counting GC.
- Non-trivial functional programming experience
- Ability to read scientific publications and implement concepts described in them