I’m exhausted. I just came back to Seattle from a 10-day trip in which I attended three different Bazel events: the Build Meetup in Reykjavik, the Bazel Community Day in Munich, and BazelCon 2023 in Munich too. Oh, and because I was on the other side of the world, I also paid a visit to my family in Spain.
Attending these events has been incredibly useful and productive: I got exposure to many ideas and discussions that would just not happen online, I got to build connections with very interesting people and, of course, it has also been super fun too to reconnect with old coworkers and friends.
This article contains the summary of the things I learned and the things I want to follow up on. These are just a bunch of cleaned-up notes which I took and are in the context of my work with Bazel at Snowflake and my interests on build tools, so this is not endorsed by Snowflake.
Here is the general timeline of the events:
- 2023-10-20 (Fri): Build Meetup in Reykjavik hosted at the Reykjavik University and organized by Unnar from EngFlow.
- 2023-10-21 (Sat): A 1-day tour of the Reykjanes peninsula with the Build Meetup crew which, while not an official conference, also led to many interesting
- 2023-10-23 (Mon): Bazel Community Day hosted by Salesforce in Munich, followed by an evening of food and drinks hosted by Gradle.
- 2023-10-24 (Tue): BazelCon 2023 day 1 hosted by Google Munich, which included an evening of drinks and games hosted by JetBrains.
- 2023-10-25 (Wed): BazelCon 2023 day 2 hosted by Google Munich, which included another by BuildBuddy that I could not attend because I departed early.
In our Bazel migration at Snowflake, we currently rely on the Bazel 6.x series—still the stable version at the time of this writing. Bazel 7 is around the corner and it brings many improvements that will, in theory, improve the developer experience significantly. Here are my highlights:
Build without the Bytes (BwtB): We tried using this feature before but it did not work well with dynamic execution, another feature that we must use for performance reasons. Bazel 7 should fix all issues we saw because BwtB is known to not carry bug fixes that are in Bazel 7 due to backporting difficulties. Google is confident that this feature works fine now because they are using it on their corp Mac laptops instead of FUSE, because the latter has become increasingly more cumbersome to use on Macs.
Path mapping and cache key scrubbing: New features in Bazel 7 allow Bazel to reuse the cache between different configurations. For example, Java targets only need to be built once irrespective of the target platform (arm64/x86), and Bazel 7 makes this “just one build” possible. This helps increase shared cache hits, reduces CI costs, reduces Bazel configuration and Git branch switch costs, and reduces local disk usage space. These features need to be enabled via flags and rules have to opt into this behavior if it’s useful for them (the Java rules do by default).
Skymeld: This feature interleaves the analysis and execution phases during a build and should be functional in Bazel 7. Using this feature should reduce end-to-end (E2E) build and test times, particularly for builds where the analysis phase takes a long time.
bzlmod: We all will be required to replace our intractable
WORKSPACEfiles with bzlmod by Bazel 8, and bzlmod is already the default in Bazel 7. The benefits for the users of the build system are subtle, but they are palpable for anyone managing external dependencies or third-party package managers like pip or Maven. It’s possible to migrate incrementally by moving individual rules into using modules. Difficulties often arise when there are repo aliases in place though.
The two news that seemed exciting to me were:
IntelliJ BSP plugin: JetBrains has been working on a new Build Server Protocol (BSP) to fix the M:N problem for build tools and IDEs—just like Language Server Protocol (LSP) did for language integrations. JetBrains has created a new version of the Bazel plugin that relies on a BSP server, and this new plugin better integrates with IntelliJ’s internal project modeling and with Remote IntelliJ. The plugin launched in beta during BazelCon and does not yet work with CLion, but they are targeting Spring 2024 to get both out. It’s still early to adopt this new plugin, but the future is bright: if VSCode adopts the BSP, we’ll finally get Bazel support throughout developer tools. Microsoft, your move. (And Emacs pretty please?)
IntellIj 2023.3.3: This new release of IntelliJ should make the Bazel plugin work well for remote development. Up until now, it was possible to open existing projects, but not import them using the Remote IntelliJ interface, which was suboptimal for desktop-less remote VMs.
Many presenters and attendees report that their companies use remote builds—unsurprisingly, because that’s the primary promise of using Bazel—and it turns out there are more companies than I thought supplying remote build and telemetry visualization services. Here are some of the interesting details I gathered:
Regional clusters: The remote build protocol is sensitive to latency, particularly for builds with low parallelism. In talking to EngFlow, they’ve measured a 10-20% build performance improvement by deploying separate clusters in different regions for a single customer. Multiple clusters are harder to operate than just one cluster… but if you accept that you need N+1 deployments anyway for reliability, you might as well colocate them with your primary offices.
Build Barn customizations: This open source remote build implementation supports having a bidirectional replicated cache (ideal for the regional clusters mentioned above), and even having small worker pools close to the users while sharing the same central cache. Meroton, who does Build Barn consulting, have set up small caches/worker pools in-office for customers, while maintaining a larger “central cache”, and gotten good results.
Build viewers: There are many different implementations of tools to visualize builds via the Build Event Protocol (BEP). EngFlow, BuildBuddy,
Gradle EnterpriseDevelocity… all were there. As an interesting data point: Develocity can ingest data from various build systems, not just BEP, and allows computing metrics from them all in aggregate. This is not be desirable in the “end state” of a build system migration, but it’s an attractive proposition while the migration is ongoing.
RE v3: There are ongoing discussions about whether a new version of the protocol should be designed or not. There is the thought that some features cannot be retrofitted into the previous protocol, but that’s not completely clear. Ed Schouten is requesting feedback in the document that he put together.
In-transit compression: gRPC has support for transparent in-transit compression (not at rest). We have thought of hacking it into Build Barn because we think this would improve E2E build times for users with less-than-optimal Internet pipes, but both EngFlow (Java) and Bloomberg (Python) report that they saw worse behavior with compression enabled. They suspect gRPC may be serializing compression operations or doing something similarly-unscalable at very low levels of the stack. Build Barn is written in Go though, so the outcome of this experiment might be different. Worth a try.
Jobs configuration: The RE protocol is latency sensitive. Ulf Adams from EngFlow recommended limiting the number of jobs per user on the server side and then configuring the local jobs number to a higher level to minimize the delays that Bazel itself imposes. George Gensure also brought up that there seems to be an unnecessary semaphore in the code that computes Merkle trees, which limits throughput, and that it should be removed from Bazel.
Dynamic scheduling tweaks: Buck2’s “dynamic scheduler” seems smarter than Bazel’s because it avoids running anything locally when the parallelism of the build graph is wide. Then, it enables parallel local execution (which they call hybrid execution) when the parallelism is narrow, as this is the common case towards the end of builds and in most incremental builds. It might be nice to try this in Bazel too because it would help reduce unnecessary network bandwidth (and also help remove the
--dynamic_local_execution_delayhack), but it may increase E2E build time for clean builds if the network is slow… unless BwtB is at play, in which case this might be a win-win.
Things to try out
My TODO list after speaking to people is… long. Here are a bunch of things I want to try:
Visualize the build graph of our product: Bazel has flags to dump the in-memory build graph as a Graphviz file, and SkyScope by Tweag allows interactively inspecting large graphs. It’d be nice to try this out and see what happens, because visualizations often help uncover issues that are hard to imagine otherwise.
Buck2’s LSP for Starlark: Meta has done a lot of work in Buck2 to make Starlark easy to write. They have things like an LSP server for Starlark, static typing, static analysis, a profiler, a debugger… While we cannot use features like static typing because Bazel doesn’t support them (yet?), it should be possible to leverage the LSP server for use with VSCode. It apparently has a Bazel mode.
Expendable CI jobs: Some CI jobs are not critical: for example, consider a job that exists purely to keep the Bazel remote cache warm for IntelliJ project syncs. Someone brought up the interesting idea of making these jobs monitor the build farm load and skip their execution if the farm is above a certain threshold (to preserve resources and minimize the chances of exhausting them).
Analyze breakdown of rules by type: Google reports that 30% of their rules are of the
genrulekind, which prevents certain kinds of optimizations—e.g. they are very expensive memory-wise because they do not benefit from Bazel’s depset internal representation. It’d be nice to see what our breakdown looks like in case we have to plan some refactoring.
Jenkins persistent runners: We currently tear down Jenkins runners quickly, but this discards all Git and Bazel state. For Bazel-only jobs, using persistent runners could help reduce E2E run times. ThoughtSpot reports a 40% reduction in average build/test time. Aspect also claims that this is an anti-pattern. BuildBuddy has done work to clone runners with hot Bazel in-memory state and claims an 8x improvement in build times.
Using actions for downloads: Even if this is unorthodox, several people use actions to download toolchains and the like to minimize the cost of downloads during pre-analysis, to keep those downloads in the remote build farm when the client doesn’t need them at all, and to minimize local disk space. I had thought about doing this too, but I was hesitant because it feels “ugly”. It’s good to hear others have thought and tested the same idea too. Note that doing this results in reproducible builds only if downloads are subject to checksum verification.
Validation actions: There are new features in Bazel to define “validation actions” to run things like linters in parallel with the build (without dependency hacks as were used before). Google is using this to run Android Lint internally, which sounds similar to other tools like the popular SpotBugs. These actions can also produce diffs to apply to the source tree to fix the issues they identify.
These are some long-running changes I’d like to get into upstream Bazel and that need follow up:
C++ linker memory model: The C++ linker memory estimation in Bazel is way off reality, and I previously upstreamed the foundations to fix the issue by exposing input sizes to the estimation logic. We still have to carry a local patch to tune the model based on our observations, but the patch is really small now. But given the diversity in the C++ ecosystem, it’d be awesome if we could parameterize the memory computation in the C++ toolchain definition somehow and avoid a local patch: imagine having a lambda that returns a CPU/RAM resource set based on the number of inputs, their size, and the compilation model. I need to engage in GitHub Discussions.
BEP sanitization changes: I brought up that the BEP is incredibly prone to leaking secrets stored in the environment and while folks knew this happens—after all, tools like BuildBuddy and EngFlow have logic to scrub secrets—they were not really aware of the extent of the problem. I have a prototype patch to scrub all environment variables from the BEP (except for a limited allowed list useful for debugging) and I think we agreed that this feature would be good to upstream. So now I need to write a proper bug report and share my audits of the protocol and the proposal for a fix. Stay tuned.
bb-clientd patch: The current implementation of the patch to support a FUSE-based output tree is going nowhere because of the upcoming Remote Output Service formalization of this same idea. Until that’s available, I don’t think it’s worth trying to maintain this on top of Bazel 6 any longer, particularly due to the BwtB improvements that are coming in Bazel 7. It seems wiser to just wait.
And to conclude, a bunch of disconnected notes about things that were interesting to me:
Buck2: This new build system from Meta looks really cool to me. On the plus side, it fixes some long-standing issues in Bazel, like having a truly language-agnostic core, being built from the ground up with BwtB and FUSE in mind, and super-quick startup time because it’s not Java; hindsight is 20/20 after all. But it also has its drawbacks, like no support for external dependencies yet or no local sandboxing (both of which are fixable). They also have nice features like limited dynamic dependency discovery, which makes tools like Gazelle less necessary. And… they are considering writing a shim to support Bazel rules in Buck2, which would make experimentation on existing projects super-exciting.
Bazel jobs scalability: There are two problems with making Bazel scale to thousands of concurrent jobs: the “1 job = 1 thread” execution model, and memory usage due to Merkle tree computation. Ulf Adams reports that he could run Bazel with hundreds of thousands of jobs back at Google when he implemented async execution by hand—the Google-internal RE protocol doesn’t use Merkle trees, so memory pressure was not an issue—but unfortunately the changes were never productionized and have been backed out of Bazel due to their complexity. Java 21 brings virtual threads and paves the way to fix this in a nicer, non-hacky way.
bb-clientd improvements: Meroton has significant experience with deploying bb-clientd for customers and are interested in the issues we face because they are the main pushers for the Remote Output Service feature. In particular, dynamic execution doesn’t play well with bb-clientd because it causes bb-clientd to backlog downloads when local actions are frequently cancelled. It should be possible to tell bb-clientd to stop downloads and resolve these issues because the FUSE protocol supports it, but it’s not plumbed through. Also, implementing chunked downloads to support debugging of large binaries with minimal network latency sounds awesome, but we’d need to quantify the benefit first: I suspect you need just a tiny portion of multi-GB C++ debug binary to produce a stacktrace.
Virtual file systems on macOS: Both Ed Schouten from Build Barn and Neil Mitchell from Buck2 report that NFSv4 has been pretty decent on macOS when compared to FUSE to implement virtual file systems, and that it should be the primary mechanism for writing virtual file systems on macOS now. FUSE is unfortunately doomed due to Apple’s desire to kill kernel extensions, which makes it really hard to install FUSE.
Avoiding BUILD files: Salesforce has undergone a Bazel migration with 4000 engineers and claims that 70% have trouble writing BUILD files. Anything that can be done to hide them / automate edits is worthwhile. Luminar has developed a C++ plugin for Gazelle and they use Nix packages to pull in third-party dependencies.
Coverage-guided test selection: It’s a common problem to have integration tests that depend on pretty much all of the codebase, which in turn causes test runs to take too long and nullifies Bazel’s test caching features. Coverage metrics can be useful in implementing a heuristic to identify which subset of the integration tests to run on a given change at the expense of occasional false negatives.
Monorepos and Git: VMware claims that Perforce is much faster than Git at syncing GBs of source code (1min vs. 12min). Meroton has developed a “monorepo emulation” mode on top of many small Git repos leveraging Gerrit’s cross-repo atomic commits. AOSP (Android) has done something similar.
Abusing “exec requirements” to tune remote workers: Stripe has implemented lots of custom features to tune the behavior of remote workers via “exec requirement” tags at the action/target level. They can do things like mock time on the remote containers to exercise timing conditions (e.g. leap year switches) or to request access to certain internal-only network endpoints. They can also emit trace data from actions (by configuring a “listener” that propagates those details) and then merge such traces into the Bazel JSON trace profile to show what exactly the long-running actions are doing. Think visualizing what a nested
make -j8is doing.
Building without the Internet: Salesforce mirrors all external dependencies internally and denies all downloads by default from untrusted sources. They do allow the build farm to talk to internal sources to fetch artifacts though, because their tests need to do that. The “resolved file” feature in Bazel can help create a catalog of dependencies, and the “download config” to deny external access.
RE for arbitrary builds: EngFlow is doing a lot of work to support arbitrary builds using remote execution, not just Bazel builds. They have “revived” Google’s reclient—a compiler wrapper that uses RE—and offer support for CMake. The idea in CMake is to use reclient where possible (compiling and linking) and to run CMake itself on a remote worker to paper over the limitations of actions not supported by reclient.
Memory consumption is a widespread problem: Everybody dislikes how Bazel consumes memory. Google is doing work on this and we should continue to see improvements. One that sounded promising is the addition of support to discard partial parts of the build graph (e.g. for things not in the critical path).
Blogging is cool: Many people opened with “I read your blog!” upon meeting them which was… flattering, I must confess. I need to write more. Subscribe to this blog if you haven’t yet!
A bunch of papers mentioned during the many discussions that happened: