Project-Wide Occurrences: A New Navigation Feature for OCaml 5.2 Users

by Ulysse Gérard on Aug 28th, 2024

Project-Wide Occurrences

With the release of merlin-lib 5.1-502 and associated ocaml-lsp-server, we brought a new, exciting feature to OCaml's editor tooling: project-wide occurrences.

The traditional "occurrences" query in Merlin modes, named "Find All References" in LSP-based mode, was used to only return results in the active buffer. This is no longer the case!

Occurrences queries will now return every usage of the selected identifier across all of the project's source files.

There are some limitations that come with this initial release. When queried from an identifier's usage or its definition, all other usages of it are returned, but related declarations are not. In particular, this means that queries should be made from implementation files, not interfaces (.mli).

In this post, we will give an overview of the ecosystem's various parts that need to work together for this feature to work.

Try It!

Before diving into technical details, let's see how it works. You can try it on any project that builds with Dune and is compatible with OCaml 5.2.

Update your switch by running opam update && opam upgrade to get the required tool versions:

  • Dune >= 3.16.0
  • Merlin >= 5.1-502
  • OCam-LSP >= 1.19.0

Since we are looking for all occurrences, we need to build an index for Merlin and LSP. Fortunately, this is well integrated in Dune, and you can build the index for your project by running:

dune build @ocaml-index

This alias ensures that all the artifacts needed by Merlin are built. You can also add --watch to always keep the configuration and the indexes up to date while you edit your source files.

Note that unlike dune build @check, the @ocaml-index will build the entire project, including tests.

Once the index is ready, you can query for project-wide occurrences:

  • merlin-project-occurrences in Emacs
  • MerlinOccurrencesProjectWide in Vim
  • Find All References in LSP-based plugins.

Here is a comparison of a references query before, and after building the index:

Now, let's dive into more technical details.

High-Level Overview

The base design is fairly simple. We want to iterate on every identifier of every source file, determine their definition, and group together those that share the same definition. This forms an index. Tools can then query that index to get the location list of identifiers that share the same definition.

The following section describes how we implemented that workflow:

  1. Compute definitions using two-step shape reduction
  2. Driving of the indexer tool by the build system
  3. Changes to Merlin to properly answer queries

Two-Step Shape Reduction

Finding an identifier's definition in OCaml is a difficult problem, mostly because of its powerful module system. A solution to this problem has been recently described in a presentation at the ML Workshop: shapes. In short, shapes are terms of a simple lambda-calculus that represent an abstraction of the module system. To find an identifier's definition, one can build a shape from its path and reduce (as in beta-reduction) that shape. The result should be a leaf with a UID that uniquely represents the definition.

This has been implemented in the compiler, and Merlin already takes advantage of it to provide a precise jump-to-definition feature.

For project-wide occurrences, we perform shape reduction in two steps:

First, at the end of a module's compilation, the compiler iterates on the Typedtree and locally reduces every identifier's shape. If the identifier is locally (in the same unit) defined, the result will be a fully-reduced shape holding the definition's UID. However, if the identifier is identified in another compilation unit, the result is a partially-reduced shape, because we cannot load the .cmt files of other compilation units (that are required to finish the reduction) without breaking the separate compilation assumptions. These resulting UIDs or partially-reduced shapes are stored in the unit's .cmt file:

type cmt_infos = {
  ...
  cmt_ident_occurrences :
    (ident_with_loc * def_uid_or_shape) list
}

Then, an external tool (called ocaml-index) will read that list and finish the reduction of the shapes when necessary. This step might load the .cmt files of any transitive dependency the unit relies on.

Indexation by the Build System

The tool we just introduced, ocaml-index, plays two roles:

  1. It finishes the reduction of the shapes stored in the .cmt files.
  2. It aggregates locations that share the same definition's UID.

The result is an index that is written to a file. Additionally, the tool can merge multiple indexes together. This allows build systems to handle indexation in the way they decide.

We only provide rules for Dune right now, but the tools themselves are built system agnostic. The Dune rules are as follow:

For every stanza library or executable, we index every .cmt file and store the results into an index file for the stanza.

  • This process, similar to linking, depends on every transitive dependency of the stanza being indexed, since shape reduction might require loading those cmt files.
  • Additionally, if any of the dependencies' indexes have changed, each stanza's index must be rebuilt.

This is a somewhat simple but heavy process, and it could be refined in the future. Right now it performs well enough to provide a usable watch mode in small to fairly large projects (like Dune itself).

Index Configuration and Reading

Last but not least, we need a way for Merlin to know were the index files are located and how to read them.

This is done by using a new configuration directive INDEX. It can be used to provide one or more index files to Merlin. Then, querying for all the usages of the identifier under the cursor is done in the following way:

  • Identify the path of the identifier under the cursor
  • Reduce the shape corresponding to this path to get the definition's UID.
  • Lookup this UID in the index files and in the current buffer's index (which is computed by Merlin).
  • Return all the locations

Future Work

Thank you for reading this post! We hope you will have a lot of fun exploring your codebases using this new feature. We have a lot of exciting improvements on our roadmap, some of which involve returning the declarations linked to an identifier and providing project-wide renaming queries.

If you are interested to learn more about these features or to contribute, please have a look at this tracking issue. You can also have a look at the announcement and wiki page. Finally, feel free to attend future Merlin public meetings and watch the talk at the OCaml Workshop during ICFP!

Tarides is an open-source company first. Our top priorities are to establish and tend to the OCaml community. Similarly, we’re dedicated to the development of the OCaml language and enjoy collaborating with industry partners and individual engineers to continue improving the performance and features of OCaml. We want you to join the OCaml community, test the language and tools, and actively be part of the language’s evolution.

Tarides is always happy to discuss commercial opportunities around OCaml. There are many areas where we can help industrial users to adopt OCaml 5 more quickly, including training, support, custom developments, etc. Please contact us if you are interested in discussing your needs.