Here at Snowflake, the Developer Productivity organization (DPE for short) is tackling some important problems we face as a company: namely, lengthening build times and complex development environments. A key strategy we are pursuing to resolve these is the migration of key build processes from CMake and Maven to Bazel.

We are still in the early stages of this migration and cannot yet share many details or a success story, but we can start explaining some of the issues we encounter as we work through this ambitious project.

A blog on operating systems, programming languages, testing, build systems, my own software projects and even personal productivity. Specifics include FreeBSD, Linux, Rust, Bazel and EndBASIC.

0 subscribers

More specifically, in today’s post I want to talk about how we diagnosed and fixed three different issues that made Bazel trip over the Linux OOM killer. These issues led to spurious build failures and made our workstations unusable due to memory thrashing.

The guiding principle behind the fixes I’ll describe is that flaky builds are infinitely worse than slow builds. A build that passes 100% of the time but is slower than it could potentially be will convince developers that make clean can be a thing of the past. A build that is really fast but breaks at random will do the opposite: it will show sloppiness and a lack of quality, leaving skeptical developers to wonder why adopting Bazel is worth the migration cost. Therefore, at this early stage in the migration process, it is fine for us to trade build speed for reliability.

Let’s dive in.

Concurrent linkers

The first problem we encountered was obvious from the onset given that our CMake builds had exhibited the same issue in the past and we had a workaround for it in place.

As is common in the build graph of complex applications with many tests, we have a collection of C++ test targets that depend on heavy common libraries. Each of these tests is linked separately, and the linker consumes a significant amount of memory to process each one of them: about 8GB per linker invocation in the -c fastbuild configuration. Unsurprisingly, if we concurrently run a handful of these on a local dev environment capped at, say, 16–20GB of RAM, we quickly run into OOM scenarios.

But, if you know some Bazel internals, you’d expect this to not happen: Bazel has provisions to avoid overloading the host machine when scheduling local actions so, in theory, we should not be seeing any issue. To summarize: the way this works is by making every build rule estimate how much resources its build actions will consume in the form of an “X CPUs and Y MBs of RAM” quantity. Bazel then compares these numbers against its own understanding of total machine capacity and uses this information to limit the parallelism of local actions.

Sounds good, right? Unfortunately, this mechanism isn’t great because it relies on the build rules to provide an accurate estimate of their resource consumption. This estimation is hard to do upfront, especially when the rules support a multitude of toolchains with potentially different performance profiles. Let’s peek under the hood and see what the C++ rules do in order to compute the memory requirements of every linker action. Here is what the code in CppLinkAction.java had to say circa 2021:

// Linking uses a lot of memory; estimate 1 MB per input file,
// min 1.5 Gib. It is vital to not underestimate too much here,
// because running too many concurrent links can thrash the machine
// to the point where it stops responding to keystrokes or mouse
// clicks. This is primarily a problem with memory consumption, not
// CPU or I/O usage.
public static final ResourceSet LINK_RESOURCES_PER_INPUT =
    ResourceSet.createWithRamCpu(1, 0);

// This defines the minimum of each resource that will be reserved.
public static final ResourceSet MIN_STATIC_LINK_RESOURCES =
    ResourceSet.createWithRamCpu(1536, 1);

// Dynamic linking should be cheaper than static linking.
public static final ResourceSet MIN_DYNAMIC_LINK_RESOURCES =
    ResourceSet.createWithRamCpu(1024, 1);

This behavior changed in commit 01c10e03 in an attempt to improve the situation based on build performance data collected at Google:

commit 01c10e030c1e453fa814d316f8f9950420bd3de7
Author: wilwell <wilwell@google.com>
Date:   Fri Jul 16 05:41:57 2021 -0700

    Memory expectations for local CppLink action

    During investigation we find out the  better linear dependency
    between number of inputs and memory.
    Using our data we made linear estimation of form C + K * inputs
    such that 95% of actions used less memory than estimated.

    In case of memory overestimate  we will make our builds slower,
    because of large amount unused memory, which we could use for
    execution other actions.

    In case of memory underestimate we could overload system and get
    OOMs during builds.

From a previous life, I knew that this piece of code was problematic for local linking, but this recent commit made me think that the new memory estimation in Bazel was close to reality: Google has a vast repository of build performance data from which to deduce a good model. Therefore, I was left to think that our build graph broke these assumptions for some reason.

Yet… something was off. As a first experiment, I tried to limit the RAM available to Bazel with --local_ram_resources=8192, hoping that a much lower limit than the default would cap the number of concurrent linkers. Interestingly, this did not have any effect. I tried going lower, specifying 1GB as the limit, and the results were equally puzzling. Why wasn’t this flag limiting linker concurrency at all? The answer is in the code logic described above: I patched the C++ link rule to print its thoughts on the resource limits and I found that the new rule concluded that all of our link actions needed only 50 MBs of RAM each. 50MB is wrong by a factor of ~160 and explains why lowering --local_ram_resources did not make a difference.

The most likely explanation for this difference is that our Bazel configuration was stuck on the old-and-rusty ld when I debugged this problem and Google drew its conclusions from gold or lld, but I do not know yet. Note that, at the time of this writing, we have already moved away from ld. Issue #17368 tracks this.

Real fix: resource set overrides

The simplest solution to this problem would have been to tweak the C++ build rules and update their memory model for our scenario: if we could tell Bazel that our linker actions require 8GB of RAM, we could have done that and called it a day.

And I tried. I researched if we could specify a tag of the form ram:Xmb for the linker rules, hoping that tags like these would override the requirements computed by the rules. Support for a ram:Xmb tag, however, does not exist. There is support for a cpu:N tag, so as a compromise I thought of leveraging this to claim that the linker uses all CPUs on the machine. But… cpu:N only applies to tests and is not recognized in other kinds of targets. Digging deeper, I discovered the --experimental_allow_tags_propagation option, which I hoped would cause the cpu:N tag to be propagated to the actions and have the desired effect, but testing revealed that this was not the case either. (I’m actually not yet sure what this flag does.)

If the C++ rules had been written in Starlark, and if issue #6477 from 2018 had been implemented, we could also have been able to paper over the problem. But the C++ rules are still native rules, which means that they cannot be modified without rebuilding Bazel. Not an option at this point.

I hit a lot of dead ends. Short of modifying Bazel to special-case our requirements, which we are trying hard to avoid, I had to find another solution that could be implemented within the constraints of what Bazel allows today. Which, by the way, I kinda enjoy doing.

Workaround: linker wrapper to limit concurrency

The workaround came in the form of a wrapper over the linker plumbed through our own toolchain definition. This wrapper is a simple script that uses flock(1) on the linker binary to allow just one linker invocation at a time. This is suboptimal because, by the time the wrapper runs, Bazel has already decided to run the action. As a result, Bazel is holding one of its job slots hostage to a script that may do nothing for many seconds, lowering overall throughput. In practice, however, this is not a problem because most link actions pile up towards the end of the build where parallelism is already minimal and where we really cannot afford to run linkers in parallel.

Implementing this wrapper sounds simple but the devil lies in the details, as is usually the case. For example: given that the wrapper has to run a specific linker, I needed the wrapper to include the path to the linker. I wanted to do this using a genrule to create the wrapper, but this seems impossible to achieve as described in issue #17401. Additionally, even after working around that issue, I encountered further problems with the cmake rule, which somehow ended up trying to invoke the wrapper script via a relative path and failed to do so. After reading the code of the cmake rule, I found that if the paths to tools are from external repos, the rules will absolutize them… so as yet-another-workaround, I created a nested workspace to hold the wrapper, which was sufficient to trick cmake into doing the right thing.

Remember: this solution was just a workaround that we could live with for a little while. We have adopted lld in our Bazel builds, just like our CMake builds do, since I wrote this draft and we have mostly adopted remote execution, both of which have made this issue invisible.

Concurrent foreign builds

The second problem we encountered was due to our use of rules_foreign_cc to build a bunch of C++ external dependencies. We have a dozen or so of these using a combination of configure_make, make, and cmake.

Our first cut at building these dependencies was to pass -j $(nproc) as an argument to the foreign rules. This works great when the actions spawned by these rules run in a remote build farm: each executor node will run the nested build in a container that will expose as many CPUs as it wants to expose and cause no harm to sibling processes. But this does not work so well in a local build. In the ideal case, these nested builds would end up evenly spaced throughout the build, spreading their resource overload to random points. Unfortunately, that’s not what we observed: our build graph has a choke point on these foreign dependencies so, during a build, it is easy to notice that Bazel has to run and wait for a bunch of these foreign compiles simultaneously.

As you can imagine, this can turn into a problematic scenario. For example: if the host machine running Bazel has 8 CPUs in total and Bazel is running 6 nested builds (based on the default --local_cpu_resources computation), each configured to run 8 parallel jobs via -j 8, we potentially have 6 * 8 = 42 resource-hungry processes in competition. If these processes compete for CPU alone, then they will take a long time to finish but nothing too bad will happen. If they compete for RAM, however, as happens with linkers as described earlier, then it’s easy to enter a thrashing situation that’s hard to get out of.

Real fix: cooperative parallelism

The correct solution to this is to teach cooperative parallelism to Bazel issue #10443: every time Bazel runs an action that can consume N CPUs, Bazel should be aware of this fact and schedule it accordingly.

You can imagine this as a cpu:N tag like the one described above, which would indicate the parallelism of each nested build and would sequence them when run through Bazel. Or it could be in the form of Bazel leveraging GNU Make’s jobserver feature to coordinate the parallelism of those submakes. I’m not going to design this feature in this post but… are you interested in doing this? It sounds like an awesome intern project that I’d be pleased to mentor!

Workaround: nested make job limits

As a compromise, I went back to the wrapper approach. In this case, a wrapper checks for details of the running environment to determine whether the nested build can use all resources of the machine or not. If it can, such as on build farm workers, then the wrapper passes -j $(nproc) to these nested builds and calls it a day. If it cannot, then the wrapper tries to be smart: for GNU Make, it passes -j $(nproc) -l $(nproc) to try to use as many CPUs as possible while accounting for load average; and for CMake, it just passes -j 2 as I have not found out how to plumb through the -l equivalent.

Like before, adding such a wrapper sounds simple in theory but becomes hard to do in practice. A specific detail that turned out to be problematic is the way rules_foreign_cc constructs wrapper scripts. The rules try to generate scripts that can work both on a POSIX Shell and on Windows’ cmd.exe… which is a nightmare due to quoting differences, plus the resulting scripts become unreadable. In the end, I got this to work, which was quite nice.

Concurrent remote actions

The third problem we encountered was the most puzzling of all. After resolving the other two issues, there was nothing else left in the build that could consume too much memory. Yet… occasionally, the Linux OOM killer would show up and terminate Bazel. And it would always terminate Bazel. Unfair, isn’t it?

Fortunately, when the Linux OOM killer kicks in, it dumps the process table along with memory statistics to the kernel log. Looking there, I noticed two culprits:

[648264.500883] Tasks state (memory values in pages):
[648264.500884] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
...
[648264.500993] [ 10337] 1970 10337 5440032 3303604 34402304 585643 0 java
...
[648264.501024] [ 30196] 1970 30196 1353096 1344352 10874880 0 0 ld
...
[648264.501030] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1970.slice,task=java,pid=10337,uid=1970
[648264.501486] Out of memory: Killed process 10337 (java) total-vm:21760128kB, anon-rss:13214416kB, file-rss:0kB, shmem-rss:0kB, UID:1970 pgtables:33596kB oom_score_adj:0
[648265.100163] oom_reaper: reaped process 10337 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

A single linker (thanks to the changes described earlier) and Bazel. If we do the math, because total_vm and rss are in pages (because why would they not), we see that the linker is using 5GB of RAM and Bazel is using… 13GB? Wait, what? Why?

What I noticed from these crashes is that they seemed to happen when Bazel was running multiple remote Compiling actions at once, at least as reported in the Bazel UI. This made me suspect (again, thanks to past experience) that the state Bazel was holding in memory for each running action was large, and when combined with hundreds of parallel actions, memory requirements ballooned. But still, 13GB was a lot, and if this were true, there would be few options for us short of growing the total RAM of our dev environments.

Looking closer, I noticed that during our initial deployment of our build farm, we bumped the max heap size that Bazel was allowed to use to 14GB. The rationale given at the time was that the build graph was too big and we needed more RAM due to the increased --jobs number. Which might be true, but this had to be better substantiated: for one, the build graph doesn’t grow with an increase of --jobs, and for another, coordinating remote jobs shouldn’t really require that much memory.

Also note that a large JVM heap limit doesn’t necessarily mean that all memory will be live. An implication of a large heap is that the JRE will postpone GC cycles for longer. So by giving Bazel a max heap of 14GB on a 16–20GB environment, we were telling the JVM that it was allowed to hold onto most of the machine’s memory—even if a lower limit could also have worked at the expense of additional GC cycles.

Real fix: measurement and tuning

The first step to solving this last problem was to measure how big the build graph really was. Seeing that our Bazel analysis phase is pretty short compared to other horrors I’ve seen in the past, I did not expect the size to be too large. But it had to be measured. This is as easy as running a command like the following and looking at the heap size in VisualVM while the command runs:

bazel build -k --nobuild //...

Note the funnily-named --nobuild flag given to the build command. This flag causes Bazel to stop executing the build right after the analysis phase is done, which means that the only thing that Bazel will hold onto memory is the build graph. Armed with this knowledge, I noticed something reasonable in VisualVM: after GC, Bazel’s memory usage was a mere 500MB—extremely far from the 13GB used during the build.

This was promising, but the initial observation of needing more memory due to the increase in --jobs was probably well-founded. What could we do about it? A good starting point, as with most things in Bazel, is to research what options exist to tune the feature we suspect is problematic, which in this case was Remote Execution. Among these flags I spotted -- experimental_remote_discard_merkle_trees and commit 4069a876, which introduced it, described pretty much the same problem I faced. Unfortunately, this flag is not yet in a stable Bazel release (6.0.0 at the time of this writing).

Luckily, this also made me find -- experimental_remote_merkle_tree_cache, which was introduced much earlier in commit becd1494 and which was supposed to improve this scenario based on data collected at Google. Here is what the change had to say:

commit becd1494481b96d2bc08055d3d9d4d7968d9702e
Author: Fredrik Medley <fredrik.medley@gmail.com>
Date:   Tue Oct 26 19:44:10 2021 -0700

    Remote: Cache merkle trees

    When --experimental_remote_merkle_tree_cache is set, Merkle tree
    calculations are cached for each node in the input NestedSets
    (depsets). This drastically improves the speed when checking for
    remote cache hits. One example reduced the Merkle tree calculation
    time from 78 ms to 3 ms for 3000 inputs.

    The memory foot print of the cache is controlled by
    --experimental_remote_merkle_tree_cache_size.

    The caching is discarded after each build to free up memory, the
    cache setup time is negligible.

Giving this flag a try made Bazel consume less than 4GB of RAM throughout the build with a --jobs value set to a much higher number than what we currently use by default. This is a 10GB of RAM savings, which also translated into a much, much faster build due to Bazel and Java having to compute and collect less garbage.

No workarounds needed to solve this problem, and keeping RAM under control always feels nice.

That’s about it for today! Our pilot Bazel builds are now speedier than they were and we have eliminated a major source of frustration for our users during our initial deployment of Bazel: namely, their workstations don’t melt under memory pressure any longer.

As a personal tip: don’t ever give into the temptations of increasing memory limits before understanding the cause behind growth—even if you have enough memory to spare and what you are doing is “just” a small increase. Caving into these temptations without further investigation means you will be oblivious to real bugs that exist in your system and that need to be ironed out for better overall performance. Be skeptical, question assumptions, measure reality, and adjust as necessary.

If you like what you read and would enjoy working on similar exciting problems, know that we are hiring in the Developer Productivity Engineering team here at Snowflake.