Monorepos have existed for a very long time. The basic idea behind them is to store the complete source code of your company or product, including all of its dependencies, in a single source repository and have an integrated build and test process for the whole.
I like to think of the BSD systems as the canonical example of open-source monorepos, but these were never seen as good practice by many (c.f. CatB). Monorepos have only become popular during the last decade because of Google’s engineering practices and their increasing desire to share details with the public. Google’s monorepo is probably the largest in the world, and if works at their scale—the thinking goes—it must be good for everyone else, right? Not so fast.
Monorepos are an interesting beast. If mended properly, they indeed enable a level of uniformity and code quality that is hard to achieve otherwise. If left unattended, however, they become unmanageable monsters of tangled dependencies, slow builds, and frustrating developer experiences. Whether you have a good or bad experience directly depends on the level of engineering support behind the monorepo. Simply put, monorepos require dedicated teams (plural) and tools to run nicely (not just random engineers “volunteering” their efforts), and these cost a lot of time and money.
As a consequence of this cost, you must be wary before adopting a monorepo model. You must have a good story around support upfront, or else you are in for long-term pain. And once you are in a monorepo, the “long term” is guaranteed because untangling the dependency mess that arises in such an environment can be next to impossible. The worst scenario is when you did not actively decide to implant a monorepo, but you end up in one due to organic or unexpected quick growth—in which case your tooling and practices are almost certainly not ready for it.
In this post, I will look at how Google is able to successfully run the world’s largest monorepo while keeping build times minimal. They can, for example, validate and merge most PRs on CI within minutes while having the almost-absolute confidence that they won’t break anything—yet it’s impossible to build the whole repository in those few minutes. I’ll restrict this post to analyzing build times and will specifically avoid talking about test times. Both are crucially important but both have very different solutions. Maybe a follow-up post will cover tests 😉.
One repo vs. many
A key feature of a monorepo, as mentioned earlier, is to have a unified build process for the whole. In the common case, the repository has a single entry point at the top level, which means that the entire repository has to be built for every change to ensure the health of the tree. It might be possible for individual engineers to hand-pick which parts of the tree to build on their development machine (e.g. by running
make on a subdirectory), but CI environments will blindly do builds from the root.
This approach, tied to the fact that monorepos grow unboundedly with the company or product they support, causes build times to balloon… until they grind the development processes to a halt. If you add to this that most build systems need occasional clean operations, frustration is guaranteed:
- Multi-minute- or multi-hour-long penalties make it very hard to interact with the codebase.
- Pull requests are almost impossible to manage as they need long validation times and might hit unpredicted merge conflicts.
- Quality suffers because developers won’t want to pay the penalty of yet another CI run just to address nits raised in the final pass of a code review.
Facing these circumstances, the natural temptation is to split the repository into smaller pieces and regain control of the build times. (This, by the way, is the specific project I’m involved in right now.) And, depending on the tooling you are subject to and the freedom (or lack thereof) you have in changing it, this may be the only possible/correct answer. But why? Why would moving to smaller repositories fix build times? The total amount of code won’t get smaller just by splitting it; if anything, it might grow even more! The answer may be obvious:
Multiple repositories introduce synchronization points. Under such a model, the cross-repository dependencies are expressed as binary package dependencies with specific version numbers. In essence, the smaller repositories leverage builds that others have already done and thus bound their build times.
Say hello to caching
Yet… Google’s monorepo is known for not using binary artifacts: they build everything at head with the exception of the most basic C++ toolchain (known as crosstool in
Google builds rely on a cross-user (remote) massive artifact cache that stores the outputs of all build actions. This cache is what introduces the same “synchronization points” that multiple repositories benefit from. By leveraging this cache, the vast majority of dependencies needed for any given build will have already been compiled by someone or something else and will be reused.
But… caching alone is insufficient to fix build times unless you get 90%+ cache hit rates overall. And that’s where the differences between awful build times in a monorepo and great build times lie. In the next sections, I’ll look into the specific mechanisms that lead to such high cache hit rates. And if Google can have them, so can you. Bazel, from its inception, has been trying to spread these practices to the public—but you can apply these same concepts with other build tools too.
Ensure action determinism
First of all, we have to ensure that build actions (e.g. compiler invocations, resource bundling) are deterministic. Given a set of input files and tools to process them, the output must not be subject to environmental differences. In other words: build actions must be sufficiently-specified so that they don’t rely on hidden dependencies that could change their outputs.
I covered this topic in depth in the previous “How does Google avoid clean builds?” post where I analyzed how this simple idea allows incremental builds to always work. I also glanced over some of the many other benefits that it brings, including how it can lead to optimal build times, which is what I’m covering here.
Deterministic actions are the foundation to reducing build times in a monorepo. So… you must sort this out before you can take advantage of any of the remaining points.
Reduce the number of configurations
Once build actions are deterministic, the next thing to worry about is increasing the chances of reusing previously-cached actions for any given build. This is generally only possible when builds use the exact same configuration, or else their outputs will differ and may be incompatible with each other. For example: a debug build will not be able to safely reuse the outputs of a release build. Which is kinda obvious, but there aren’t just two configurations. There tend to be more. Many more.
In fact, the number of build configurations grows with the size of the project—and, paradoxically, the number of options grows to reduce build times. You see: when engineers suffer from painfully slow builds, in the best of their intentions, they add knobs to conditionally compile certain parts of the project to speed things up. Which is great in single-user/single-machine builds, but it doesn’t scale and is an anti-pattern in a monorepo world where we must have almost-perfect caching across users.
Thus, for monorepo builds to be quick, we actually have to homogenize configurations so that most engineers and CI are running almost the same configuration. In practical terms, this means having debug builds for interactive development, and release builds for production usage. And that’s about it.
As an anecdote, we encountered this specific issue while onboard a certain iOS team into remote builds. This team had been developing on laptops only, and because they had slow builds, the engineers had added many feature flags to make components optional. When we moved them to remote builds and remote caching, we saw very little benefit: these users were still rebuilding everything and the reason was because they could not make use of previously-cached results due to different configurations. Once we removed most build-time conditionals, they started truly seeing the benefits of remote caching/execution and faster builds. Counterintuitive, isn’t it?
Use CI to populate the cache
By this point, we know that our actions are deterministic and that user builds have a chance to reuse the cache because we have made builds uniform. But here comes the question: how do we seed the cache?
There are different alternatives, but we must first talk about security.
A cross-user cache of build artifacts must be populated only from trusted sources. We cannot have a developer’s workstation inject output artifacts that were built on their machine because the developer could have tampered with the outputs. This means that the cross-user remote cache can only be populated from machines that cannot be tampered with. (If the cache is not used across users, then no extra precautions are necessary.) These trusted sources come in two forms:
The first one, and the one that’s easy to apply almost in any case, are CI runs. CI runs happen on machines that are not (and should not be!) directly accessible by the larger engineering population, thus we can trust that the output they generate comes precisely from the code and tools that were fed to them.
The second one, and the one that’s more subtle, are user-initiated builds. When an engineer is building on their workstation, we can cache their build outputs across users if we generate those outputs using a trusted environment. Such a trusted environment comes in the form of remote execution. Under a remote execution scenario, the build machine sends individual build actions to a remote service. This service, which is trusted too, runs actions on behalf of the user, but the user doesn’t have a chance to interfere with the outputs before they are saved into the cache.
You might think that a compromised compiler is an attack vector in this scenario: the engineer modifies the compiler, sends it to the remote CI machine or remote execution worker to build a piece of code, and the outputs that contain malicious code are injected into the cache. But that’s not the case. Remember: the compiler is also an input to the action, just like the source files are. If you tamper with the compiler, then the action’s signature changes, which means the cache key changes, which means nobody will be able to address such cache entry unless they have the same compromised compiler.
In the end, you want to take both approaches. You want periodic CI runs to populate the cache as these ensure that the cache is frequently repopulated from HEAD. And you want to incrementally cache outputs that originate from interactive builds and PRs to keep up with drift and to keep up with possible divergences in the configurations that they use.
As long as users stick to the blessed configurations discussed in the previous section, and as long as you have these sources of data to populate the cache, then you should start seeing high cache hit ratios and faster build times overall.
Are monorepos worth it?
Definitely. I’m a believer that they are good, and not only because of Google: the BSDs taught me this principle a very long time ago. That said, it’s true that they can be a poor choice unless your engineering practices and tools keep up, which as mentioned in the opening, can be extremely expensive. If you cannot afford the cost, it might be better to stay with multiple smaller repositories and let smaller groups of developers handle them in an ad-hoc manner. The whole won’t be as nice, but the overall experience might be less frustrating.
In the end, monorepos are an “implementation detail” of bundling multiple separate components in one place. Having a monorepo should not imply that you must build the whole thing every time you make a change. Having multiple repositories should not mean that they are isolated islands. And the latter is, precisely, what we are trying to prove in my current team: we want to see how well we can integrate smaller repositories without actually being a monorepo. Because, with suboptimal tooling… build times are not the only problem. Source control is another major one: Git, if you happen to use it, is not the greatest choice for monorepos.
If you enjoyed this post, subscribe to this blog to keep up with two upcoming, related topics! One will be on ensuring build times remain low once we have achieved them; and the other may be on how Google keeps test runs on CI fast.