Garbage collection for Irmin

Mentor: Clément Pascutto

Location: Tarides office, Paris

Irmin is an OCaml library for building mergeable, branchable distributed data stores. It is built on the same principles as Git.

Irmin data stores currently do not provide garbage collection, which can be an issue for applications that store a large amount of data. The goal of this internship is to provide data recollection:

  1. Unreachable blocks in the store should be collected
  2. Very old objects might need to be removed

For simplicity, (2) can be transformed into (1) by rebasing the history. A downside with this is that two rebased stores might not be able to sync anymore if they pruned different parts of their history.

Solving (2) without transforming it into (1) means supporting partial fetch/push, which the Git protocol doesn't handle very well.

The first part of the internship will consist of expanding the Irmin API to allow proper store manipulations:

  • Irmin.Type should have an explicit combinator for internal keys. This is useful for the GC to follow all the links, even the ones between leaf objects.
  • CONTENT_ADDRESSABLE_STORE should expose a delete operation. This should be atomic.

This second part will consist in an exploratory work on GC strategies and their implementations in Irmin. A first experiment can consist in a "stop-the-world" GC, where all concurrent operations are stopped while the GC is scanning the graph of objects (similar to git gc). Of course, this is not an option for interactive applications where latency is critical, so other strategies will have to be proposed and tested, e.g. a reference counting GC.

Necessary skills

  • Non-trivial functional programming experience
  • Ability to read scientific publications and implement concepts described in them
For more information and to apply, please email clement@tarides.com.
← See all available internships