Just like that, BazelCon 2024 came and went. And just like that, Blog System/5 is about to turn 1 year old as the very first post of this newsletter was the recap of BazelCon 2023. Since then, this newsletter has amassed 1200+ subscribers and I have surpassed 20 years of blogging—so, thank you for your support, everyone!

To celebrate this milestone, it’s obviously time to summarize the two events of last week: BazelCon 2024 and the adjacent Build Meetup. There is A LOT of ground to cover. I could probably write a separate article for each of the sections below, but folks coming for a summary of the conference will want to see everything in one place… so you get this massive piece instead.

Overall, this is a 40-minute read… but let’s face it: getting 3 days worth of content in less than an hour sounds like a good deal, doesn’t it? Feel free to pick and choose sections though; each stands on its own and each paragraph represents a thought I captured from some presentation or discussion.

By the way: no LLMs were involved in this work, thus you can only imagine how much effort it went into putting this together. So, subscribe to support this and future posts and… enjoy!

A blog on operating systems, programming languages, testing, build systems, my own software projects and even personal productivity. Specifics include FreeBSD, Linux, Rust, Bazel and EndBASIC.

0 subscribers

Schedule and logistics

BazelCon 2024 was hosted at the Computer History Museum in Mountain View, CA, on October 14th and 15h. The conference had a single track of talks and an overlapping track of BoF sessions. I attended most of the talks but just one BoF—the IDE one. The conference was followed by evening socials hosted by BuildBuddy and JetBrains/EngFlow.

The Build Meetup followed BazelCon 2024 on October 16th, and this time around it was cohosted by Meta and EngFlow in Menlo Park, CA. The meetup had three tracks: one for Bazel, one for Buck 2, and one for remote execution. I helped facilitate the Bazel track, including taking notes for the afternoon unconference (notes at the very end), and gave a 15-minute talk on Bazel at Snowflake. Facilitating the track was fun, but unfortunately this made me miss the Buck track and I’m still wanting to learn more about it.

The notes I gathered from the talks and discussions are grouped in the following topics:

Community and adoption

The conference opened with the usual briefing and SOTU, which included information on the conference itself and the latest news in the Bazel project. These were followed by talks that touched upon the community and current adoption trends, and these are the thoughts I gathered:

  • Conference ownership: This was the first BazelCon not run by Google. The conference is now owned by the Linux Foundation and Google was a sponsor at the same level of BuildBuddy and EngFlow. The Linux Foundation has run the conference excellently thanks to the dedicated team of professionals that they have for the task.

  • Conference history: Looking back, you should know that the Bazel team was (and still is) distributed across two sites: Google NYC and Google Munich. To mitigate the difficulties that arise from geographical distribution, the team held internal-only summits every 6 months. It wasn’t until 2017 that this internal conference became BazelCon and, after that, the team alternated between an internal event and an external event. Last year, BazelCon had 250 attendees with a waitlist that contained 200 extra folks, and this is the second year that the program committee contains non-Google folks (myself included).

  • Google’s investment in Bazel: Does this shift to the Linux Foundation change Bazel’s governance to make it more community-driven? Not yet, but it’s a step in that direction! Google still owns Bazel and continues to invest in it because they can leverage contributions from strong non-Google engineers and because Google maintains a bunch of open-source projects that rely on Bazel. But Google also wants to have a more open community to guarantee that Bazel continues to thrive even if company-internal priorities change.

  • The bazel-contrib organization: To support these changes, a few folks have created a new GitHub organization and have gotten Google to donate many repos to this new org. The organization creators have a desire to start a new foundation to support and direct these projects and are looking for about 5 companies to join.

  • Bazel adoption stickiness: The REBELs research group at the Univesity of Waterloo conducted a study over 35,000 significant GitHub projects to see why, from the 1.5% of projects that adopted Bazel, 11% of those abandoned the migration at roughly the 2-year mark.

    • Abandonment reasons: The reasons cited for abandoning Bazel were varied, but can be summarized as encountering technical challenges, issues with team coordination and onboarding, and seeing community trends (like Kubernetes deciding to move off of Bazel).

    • If not Bazel, then what?: These projects primarily decided to move to language-specific tools, especially for languages with strong native tooling like Go and Swift. However, a significant proportion also decided to move back to “inferior” tools like CMake or even GNU Make. These projects acknowledged that these tools are less feature rich and don’t support integration with other languages, but they are conventional and easier to understand by the community.

    • If Bazel, then why?: On the plus side, for the cases where Bazel stuck, we can find the reasons we expect: dependency management, faster builds, and even the influence of other projects shine as reasons for making Bazel a good choice.

  • Adoption suggestions: The CTO of Ergatta spoke on how his small company has been able to successfully leverage Bazel without massive investments in infrastructure. His suggestions were to keep things simple (e.g. by using the GCS-backed cache despite its flaws, or by doing the trivial bazel test ... on CI); to expedite solutions even if not perfect (e.g. by wrapping foreign builds in Bazel); to look for champions and leverage them for adoption; to focus on “good enough” results; to write documentation for your successors; and, surprisingly, to do frequent and incremental changes to workflows instead of big-bang “improvements”.

Remote Execution

As you know, remote execution is Bazel’s raison d’etre. Consequently, the SOTU covered various updates on this topic and many talks included related thoughts:

  • SOTU updates: The remote output service is now available and bb-clientd offers an implementation. The execution log is now 100x smaller than before and only has a 3% runtime overhead, which is useful to debug cache hits. Upcoming changes include concurrent uploads to the cache in the background without blocking action completion, GCing of local caches (disk cache, install cache, etc.), and BwoB improvements to decrease incremental build times when Skymeld is in use.

  • Remote persistent workers: AirBnB observed 2x slower builds when not using remote persistent workers. They use dedicated pools for Java and Kotlin and route all actions into these pools, and reminded us that tagging targets comes with pitfalls because targets can spawn multiple actions and just one of them needs the worker.

  • Queuing control: Remote execution environments typically have different worker pools to support different build environments or to optimize cost. But note: the goal shouldn’t always be to tolerate all possible load all the time: it is reasonable to put limits on the worker pools like AirBnB does so that you can dedicate a pool for interactive builds and prioritize low latency, but also create a separate pool for CI builds where you put hard caps and tolerate queuing.

  • Target size matters: Dealing with tons of individual small files is costly. In some cases, you may be better off tarring them up and declaring a single input to an action and making the action unpack the tar during execution. This came up during the talk on high performance builds for web monorepos, and I found this interesting because this goes in the opposite direction of tree artifacts which have often been problematic.

  • Action deduplication: Any remote execution service worth its money can coalesce in-flight actions to save on cost. This shines in CI with builds that have long-running actions, and Canva reports that they see 500k dedups per day on about 3 webpack builds. One consideration is that flaky test passes/failures are amplified by action coalescing: without this feature, flakes will show up as random failures over time, whereas with this feature, failures will be clustered. Extracting a signal around flakiness becomes harder.

  • Determinism checks: While not necessarily an issue with remote execution, having remote execution exacerbates the problem of non-deterministic actions in a build. Spend the time to set up a CI job to look for non-determinism and alert on it. Specific problems that often arise are timestamps in JARs (rules_antrl is impacted), absolute paths (rules_kotlin is impacted), or diagnostics information (Scala sdeps file is problematic).

  • Postmortems and learnings: Ulf Adams gave a talk reflecting on four different incidents that EngFlow’s remote execution service suffered. Personally, seeing companies introspect their failures makes me trust them more than not, but YMMV. In any case, I do not necessarily want to go over the incidents per se, but I do want to go over the recommendations:

    • Avoid RPC cycles and set timeouts: The call graph between services must not have cycles, which is hard to enforce because these may show up organically over time. Also make sure to have timeouts on all RPCs, although the RE protocol makes this difficult because it has long-running RPCs.

    • Make sure flow control is configured: This is a feature of gRPC that allows a server to keep memory consumption in check and is achieved by “slowing down” incoming traffic (thus “controlling the flow”). When proxying traffic in a server, special care must be taken to connect the flow control logic of the input and output streams of the proxy. Otherwise, a slow client can have ripple effects through the system and cause OOMs.

    • Auto-scaling is tricky to get right: From a cost-savings perspective, you want to downsize servers as quickly as possible and increase them slowly. But if you do this and there is queuing, it’s possible that auto-scaling will make queuing worse.

    • Guardrails build trust: You might say that user-induced problems are not service problems: after all, if, for example, they cause compile actions to time out after 20 minutes, it’s “their fault” for doing the wrong thing. However, if this happens, there obviously is something wrong—and it’d be nice for the service provider to notice and have prevention features in place just like your water provider can detect leaks. (And yet… AWS still doesn’t have a way to put a limit on spend.)

IDE support

If you have had to support any developers converting from “legacy” build systems to Bazel, you may have realized that… the IDE features they are used to don’t quite work after the migration. Things have gotten better in the IDE space for Bazel but aren’t great yet. Fortunately, there are a few news that give hope:

  • New JetBrains plugin for IntelliJ: Released to the marketplace during the conference, this new plugin provides tighter integration between IntelliJ and Bazel through the BSP abstraction layer. This new plugin can represent the Bazel build graph in the IDE, offers syntax highlighting for bazelrc files (and, feature request: could offer docs on hover), supports inserting breakpoints in Starlark evaluation, implements “fast build” correctly, and optimizes the sync process. There are some gaps compared to the old plugin, but I’m super-excited by what’s coming because the old Google-owned plugin is… underwhelming. See the feature list and the roadmap for more details.

  • Build Server Protocol (BSP): The BSP is a new protocol that aims to achieve the same thing that the Language Server Protocol (LSP) did: namely, solve the M:N problem between IDEs and build systems. The idea is to have an intermediate abstraction for the build system so that M IDEs can talk to N build systems by means of an intermediary process that converts the BSP to build actions. The BSP is still young though and, right now, it’s pretty rigid because it has deep knowledge of the languages it supports. In particular, there is no support for custom rules yet, so get involved if you want to see this happen.

  • Old plugin for JetBrains: The old plugin will be fully supported until mid-2025 (although I predict they’ll have to backpedal on this and extend this date). This plugin has many shortcomings because it originates from Android Studio and contains assumptions about Android and about Google’s own infrastructure. That said, there is a new feature called “query sync” that optimizes the sync process massively… at the expense of having to manually “enable analyze” for any files where deep IDE integration is desired. The problem is: you have to enable analysis to get insights on generated files, and generated files tend to be everywhere thanks to protobuf, so… you’ll likely find yourself “enabling analyze” all the time. Shrug.

  • Good IDE support is critical: Having a good IDE experience is crucial for developers, but the irony is that the people that often work on build migrations or the IDE are experts and know how to tolerate imperfect IDE support. This is not the case for most developers, particularly those that must get up to speed… so keep that in mind. Running user interviews, surveys, etc. is a necessity to understand what people truly perceive as problems.

  • Compilation database: JetBrains is not the only contender in the IDE space. There are many more (wink, wink, Emacs) and BSP will make offering a Bazel experience within them possible. Until that’s ready, though, you may need to manually deal with a “compilation database”, especially if you want to integrate with VSCode. A compilation database is, simply put, a JSON file that contains the commands to compile each and every translation unit in a project. There are various options to generate one of these with Bazel, all with different trade-offs:

    • Intercept builds: Use bear as a wrapper to run the full clean build. This works with some build systems, but unfortunately, Bazel’s client/server model doesn’t allow bear to intercept the actions it spawns.

    • Extra actions: Add a “listener” extra action that listens for CppCompile. This also requires a full clean build and is deprecated by aspects, but is the approach that kythe uses.

    • Action graph query: Does not require a full build, just a warmed up analysis cache. However, this does not support tree artifacts. Examples: bazel-compile-commands-extractor, bazel-compile-commands (two forks).

    • Aspect: Does not require a full build, but the generated commands may not be identical to the ones that are actually used. For example: if you have a code generator that produces a tree artifact and then is fed as an input to a cc_library, then the tree artifact is represented as a CppCompileActionTemplate that is only expanded to a specific command line at runtime.

  • Non-blocking queries: The Bazel server has a global lock that prevents it from running more than one command at a time. This is a problem because the IDE needs to issue Bazel queries all the time, but those queries conflict with other user actions like builds and tests. There are various solutions to this problem, which we discussed in the unconference:

    • Separate output bases: This works but it’s heavy-handed because you end up with two full separate builds and two separate Bazel server processes.

    • Simpler tools: Many operations on build files do not require the full weight of Bazel. Google has implemented simpler tools that operate on those precisely to bypass the overhead of calling ito Bazel. These tools include buildifier, buildozer, or the “fast builds” feature of the IntelliJ plugin.

    • Parallel Skyframe evaluation: Skyframe’s inherent design is to be purely functional, so in principle it should be possible to perform multiple operations on the graph in parallel. Unfortunately, there is a lot of mutable state outside of Skyframe in Bazel and, while theoretically possible, fixing this is a lot of work.

    • Implementing a depserver: Instead of running a separate Bazel on the same machine to answer queries, you could push this responsibility to an external service. Such a service is easy to build (you can imagine running bazel query ... on each commit) but the problem arises when you want to issue queries against this service from modified source trees.

Inner loop development

Bazel is the tool that glues together the inner loop of the development process. As a reminder, the inner loop refers to the the edit, build, and test cycle, and Bazel has tentacles on these three stages. Various talks gave thoughts on this topic:

  • Avoid Bazel in the inner loop: AirBnB reported that they noticed significant overhead and lack of incrementality in certain operations when trying to hook up Bazel into the IDE. They have the equivalent of “fast builds” in their own IDE plugin and have reduced incremental builds from 30 seconds to 1 second. This is a big deal for interactive builds. Personally, I think that having to side-step Bazel is ridiculous: Bazel is designed to be optimal and has perfect knowledge of the state of the world, yet it’s too slow for quick operations.

  • Integrate with native tooling: Bazel works across many ecosystems, but as such, it’s friend to none. For example: Go developers want to work with the (super-fast) Go tooling so, if it’s feasible, it’s interesting to allow using such tooling directly. The folks at LinkedIn created a rule that generates an .envrc file to expose the native Go tools and relies on direnv to make this work transparently for users.

  • macOS as a client is common: Even for shops where all software runs on Linux, developers tend to have Macs and want to work locally. You may or may not agree, but if you have to support inner-loop development on the Mac, there are various things to consider. One is that cross-platform caching for machine-independent actions like Java may not work, doubling cache requirements; for this, you may consider using “universal binaries” (aka shell scripts that wrap multiple binaries and choose the right one at runtime). And you should really set up non-determinism checks.

  • No build files: To my dismay, tons of people seem to really want to not have build files. This makes me sad because manually-curated build files serve as documentation for the conceptual architecture of a project and allow, at PR review time, to see changes to the interactions between components. You might say that #includes or package imports in the source are sufficient for this—but they really aren’t: they are too fine-grained and “innocent-looking”. I think I’ll need to write a follow-up article on this point.

Toolchains

Bazel supports targeting multiple architectures and systems, and at the core of this support lie toolchains. There were various talks that touched on them:

  • Toolchains 101: In Bazel, rules convert providers to action templates, and toolchains help convert those templates to actions by replacing two things in the templates: the paths to the tools required by the action, and the arguments to pass to those tools. Simple, right? But defining toolchains has always been a black art.

  • Simplified toolchains: Google brings a new way to define toolchains that’s much easier than the original one. The new mechanism relies on build rules: a toolchain rule declares a toolchain, and the toolchain depends on a “tool map” rule, which is a flat map of tool names (strategically specified as labels for typo- and type-checking, not flat strings) to binaries (also specified as labels). These tools are also connected to argument rules that define the sequence of arguments to apply to each tool. The support for “features” in core Bazel can potentially be removed in favor of this.

  • Default toolchain no more: Following on the previous, Google said that there is a desire to remove the concept of a default C++ toolchain in favor of explicitly defining a hermetic toolchain like almost all other rules do. This was met with cheering from the audience.

  • Debugging: It’s funny how, despite the complexity of toolchains and how we should make things easier for users… --toolchain_resolution_debug’s help claims that the flag is only for experts. PR 19926 made the debugging messages much clearer, but you should still understand the toolchain resolution algorithm to make sense of issues, and the talk on “Creating C++ toolchains easily” explained that.

  • Universal binaries: Bazel doesn’t like sharing artifacts across different architectures. This is generally the right thing to do but, for platform-agnostic languages like Java, this has the potential of doubling remote execution and caching costs. There is an ongoing discussion upstream on what to do with this issue and various partial implementations in the code (like the --experimental_platform_in_output_dir flag), but no great solution yet. In the meantime, the alternative is to implement “universal binaries”. The idea is to create a shell script that wraps a binary tool built for various systems and selects, at runtime, which one to run.

  • C++ shows up everywhere: Even if your build doesn’t seem to require C++ anywhere, the requirement to have this toolchain installed shows up pretty much everywhere because of… protobuf. bzlmod has the potential to fix this because the protobuf bzlmod packages ship with prebuilt binaries.

Sandboxing

In order to offer reproducible builds, Bazel offers sandboxing features at the action level to ensure that actions can only do what they are supposed to do. I used to work in this space while on the Bazel team, so all talks on this topic were really interesting. Here are the notes:

  • macOS sandboxing is slow: Ah, the issue that never dies and that came up again in the unconference. While it might be true that macOS handles symlinks poorly, I’m still not convinced that there is anything that makes macOS sandboxing inherently slow. Yes, it will be less sandbox-y than what you get on Linux, but performance-wise it should be OK. All of my testing years ago didn’t show sandbox-exec as a performance bottleneck, and the sandbox doesn’t kill Bazel’s own build performance so… something else is at play here. Which brings us to the next point.

  • Persistent state: Some compilers like Objective C’s build a module cache as they run. This cache is intended to be shared across invocations of the compiler. However, when the sandbox is in effect, sharing of this cache becomes impossible, which in turn makes builds incredibly slow. It may seem as if the sandbox itself is slow, but the real problem here is the lack of state sharing across actions and I suspect this is why people remain convinced that macOS sandboxing is slow. And, by the way, this impacts more than just Objective C.

  • Sandboxing scenarios: Is it really necessary to sandbox all rules? genrules, sure, but if we can reason about the behavior of specific rules, we can probably disable sandboxing for them and remain correct, which would decrease build-time overheads. (Remember that the sandbox was never about security.)

  • File monitoring: An alternative to sandboxing is to monitor the activity of build actions and validate, post-build, that they didn’t access things they were not supposed to. This is much faster than sandboxing as it avoids the need to construct on-disk sandboxes… but fails when tools decide to read directories. So… this approach isn’t really feasible. JavaScript tooling loves to expand globs.

  • Hermetic sandboxing: The traditional Bazel sandbox does not restrict reading system-wide paths so it cannot guarantee that an action doesn’t access certain system headers or arbitrary tools in /bin. There is a “new” hermetic sandbox for Linux that fixes these issues—but, obviously, is hard to put in practice. See --experimental_use_hermetic_linux_sandbox.

  • Base POSIX system: With a hermetic sandbox, it becomes very tempting to model the host system’s base libraries and tools (bash and the like) as a toolchain. rules_sh from Tweag can help here.

  • Local remote execution: A different mechanism to achieve sandboxing would be to use remote execution against a local service. This service could run on a VM or similar and offer a mechanism to cross-compile against weird architectures. This wouldn’t be quite the same as the Docker strategy.

  • Separation of responsibilities: Somebody mentioned that it’d be neat to separate the sandboxing logic into its own thing that isn’t in Bazel’s core. I agree. The linux-sandbox binary that Bazel contains is somewhat akin to this, and I previously implemented something similar in sandboxctl (for very different reasons). In any case, this would need to be multi-platform and written in a decent, safe language (cough Rust cough).

Monorepo issues

With scale come problems, and large monorepos tend to exacerbate issues that have always existed in small repos but that go unnoticed otherwise. Here are some of the issues that talks mentioned:

  • Glob handling: Globs are expensive to handle in Bazel, especially when running on networked file systems. Globs are parsed and then expressed as a DAG of functions within Skyframe. Right now, due to limitations in Java’s concurrency model, globs are evaluated twice, but Java 21’s green threads may alleviate this. Related, I was chatting about this with one of the Buck 2 authors and he said that they take a very different approach and don’t find the issues that Google described: basically, they traverse the file system once and then apply the glob as an in-memory only operation.

  • Monolithic targets: Any large piece of code that has grown without a build system that enforces boundaries across components will grow at least one component that has thousands of source files in it with cyclic dependencies among them. These are harmful to the organization as they tend to reflect the “broken windows” culture of the team. Tinder explained how they tackle this problem, with basically boils down to: ensuring there are tests, ensuring that leadership is onboard with the cleanup, creating automation to extract files from the monolith (most of the work is repetitive), and then creating visualization tools to graph the problem and identify subsets of files that can be easily extracted.

  • Gradual rollout of library updates: During the unconference, we discussed how to deal with version upgrades of a single library in a world that favors the “One version policy”.

    • Prerequisites: We agreed that, to succeed at upgrading a library, it’s critical to have tests in place and leverage tooling like buildifier to perform the upgrades automatically where possible.

    • Big bang upgrade: One approach to the problem is to do the version upgrade in one go. This is sometimes really hard to do, and can also be risky because any problem anywhere in the repo would imply a rollback for everyone.

    • Treat it as an exception: Another approach is to get an exception from the “One version policy” and temporarily allow two versions of the library in the repo. You can leverage the alias rule to introduce an indirection to the version in use, and then slowly migrate packages. This can work but diamond conflicting dependencies can be a real problem.

    • Parallel build: Yet another approach is to create a parallel build and test infrastructure for the new version. Maybe use a build flag to select the version to use; maybe duplicate the targets and rename them. The goal with this approach is to keep the “new version” on the side until it all works, and then submit a trivial “3-line PR” that flips the default. For this scenario, “project files” would be useful to group the new set of targets and automatically set the right configuration for them; Buck 2 has something similar in its PACKAGE construct.

    • Different package names: And another approach is to do what Java tends to do, which is to rename the libraries on major version upgrades so that they do not conflict at all. This is something that the upstream library maintainers have to do, not you. The problem with this approach is that semantic version is not always accurate and upstream maintainers tend to do breaking changes during minor version upgrades without realizing it.

  • git pain: Bazel and large monorepos go hand in hand, but Git—the most popular version control system—doesn’t like large repos very much: some operations scale with the size of the history (git clone) while others scale with the size of the repo (git status) instead of scaling with the size of the change.

    • Mitigation tools: There are various options that can mitigate the problem, including the FSMonitor, the “untracked cache”, keeping .gitignore small, and using sparse checkouts.

    • Materializing a portion of the tree: Uber previously tried using a Bazel query to find all files required for a build and then using the result of the query to just check those files out from Git. While this can work, it led to confusing error messages when there were missing files. And also, this poses the question: where do you run the query given that the query needs access to the full repo?

    • git archives: Git is a very chatty protocol. Mark Zeren from Broadcom reports that, while Perforce can easily saturate the network, Git can rarely do so on a similar repo. The slowdown is even visible when using the loopback network interface. Prebuilding git-archives periodically and using those to build the initial state of a Git clone can significantly improve performance. He mentioned that it’d be good to build a service to automate this, and offered help in doing so.

    • Delegating file accesses: There is an open feature request in #16380 asking for a way to delegate file accesses to another tool. This, to me, sounds like FUSE. Someone brought up that FUSE doesn’t work well on macOS anymore and that NFS can be used as an alternative, and Meta’s EdenFS or Buildbarn’s bb-clientd’s do that just fine. Another option is to implement a custom VFS instance within Bazel to directly talk to a remote file system.

  • Resource consumption: Bazel’s memory usage is… troublesome. First, because Bazel uses a lot of memory, and second because of the JVM’s heap limits. Newer Bazel versions were able to reduce retained heap memory by 95% after a build and 25-30% overall during a build… but that’s not very useful because peak memory consumption is what matters in many cases. The team is looking at various alternate solutions to tackle this problem:

    • Skyfocus: Bazel 8 contains a new feature called Skyfocus, in experimental state. The idea is to perform GC based on a user-defined working set. This is useful in large monorepos where users want to only work on a small portion of the tree, but there are questions about the usability aspects and UX of this solution.

    • Skeletal analysis: The goal of this work is to try to change the evaluation model to reduce peak memory by 20%. The idea is to trade memory usage for wall time, which means this will be useful for CI but could be harmful for interactive builds.

    • (Remote) Analysis caching: The loss of the Bazel analysis cache across reconfigurations or server restarts has always been a problem and a major usability annoyance, so this is finally being looked at. The goal is to be able to cache configured targets, aspects, action execution results, and other Bazel-internal state, possibly storing this information in the “standard” remote cache by using a special key. This could also help with a ~70% heap and runtime reduction for analysis phases from cold start, and it could be used to prune entire subgraphs during the execution phase. This would be a massive improvement for IDE interactions, as IDEs primarily rely on analysis-time operations only.

    • Distributed analysis: The idea is to shard analysis across Bazel workers. This sounds… OK for gigantic monorepos like Google’s but I’m not sure I see the use case outside of that environment. Still in very early stages of discussion.

blzmod and external dependencies

The WORKSPACE in Bazel has always been a wart: Google did not have this feature in Blaze but the need to support out-of-tree third-party dependencies became glaringly obvious when they open-sourced Bazel. The WORKSPACE was then bolted on and we have been suffering from its deficiencies for years. bzlmod promises to fix them all, and of course there were talks (and a BoF) on it:

  • SOTU dependency updates: bzlmod now provides a vendor mode to download all dependencies of a target or all targets, which is very useful for CI/CD offline builds (e.g. to exercise disaster recovery). MODULE.bazel now has include support and use_repo_rule for easier migration. The lock file format has been revamped to minimize merge conflicts. The WORKSPACE is disabled by default in Bazel 8 and won’t be available in Bazel 9. As for future plans, the repo cache will be shared across workspaces and will include the results of evaluation, not just downloads.

  • SOTU rule updates: The Python, Android, Java, and shell rules have all moved to Starlark. Protobuf and Android support have moved as well, and this comes with new things like “mobile install v3”. The goal for next year is to complete the Starlarkification effort (with possible new extension points to call into Bazel from the Build API).

  • SOTU Starlark and the Build API: All struct providers are gone; aspects can propagate to target toolchains; and there are now dormant dependencies for tests. As for upcoming changes: symbolic macros in Bazel 8 (more later); rule finalizers; and types for Starlark (which are compatible with Python 3 but not with Buck).

  • Handling internal artifacts: It is common for vendors to provide binary blobs that don’t integrate with Bazel. Azul’s JDK is an example. For these, it is interesting to consume them into the build via a workspace repository, but the question then becomes how to do this with bzlmod. The answer is to “layer multiple registries”: you’d have a company-internal registry that exposes these binaries, and then fall back to the BCR for everything else.

  • Disabling the BCR: For companies that want to have tight control on what they access during the build, it’s perfectly possible to bypass the BCR, either by pointing Bazel to an internal registry or by vendoring sources. The bazel vendor subcommand, which I didn’t know about, can come in handy.

  • Missing packages: There are some glaring omissions from the BCR right now, which include googleapis, gRPC, and rules_scala. The maintainers of the former are not very cooperative, but progress is being made. As for gRPC, there is a desire to avoid pulling in support for all languages all the time, which makes this technically harder.

Secure and auditable builds

Maintaining the integrity of the software provenance chain is difficult and “recent” events like the XZ Utils Backdoor have shown the criticality of, at least, having an audit trail of what goes into software builds. This is where SBOM and other techniques come into play, and Bazel-based builds are perfectly positioned to provide these assurances:

  • SBOMs: There are many different formats to produce a Software bill of materials (SBOM), and a few of them like CycloneDX and SPDX are standardized. It’s important that these files are “merge-able” so that, when you import a third-party dependency, you can assemble their own SBOM.

  • Types of targets: It’s tricky to generate the SBOM, and not all rules to generate one did or do the right thing. For example: do all deps belong in the SBOM? What about deps that say neverlink=1? What about build-time only deps like compilers?

  • Rules: Bazel has had a licenses package-level function for a long time, but it is insufficient to generate an SBOM. The newer rules_license offers a much more complete solution to tracking licenses and generating SBOMs.

  • Build assurance: Other than declaring licenses, it’s important to be able to verify where builds came from. SLSA defines various levels that capture different requirements. At the very least, you should target SLSA 2, which requires builds to be produced from trusted environments and to prevent arbitrary users from tampering with those builds (e.g. only allowing uploads to the Action Cache by CI). SLSA 3 is even more strict but RE doesn’t yet provide a mechanism to implement it; there are some talks to support it in V3 and maybe backport it to V2, but no concrete plans yet.

Symbolic macros

Macros are problematic: they tend to cause targets to require public visibility; they tend to pollute the list of available targets in a package (the output of bazel query is unusable); and they have the potential of bloating the build graph massively. There was a full talk from Google on a new feature that’ll make these less of an issue:

  • New feature: Bazel 8 ships with symbolic macros, which are first-class citizens of the build graph. Symbolic macros are declared similarly to rules: instead of specifying rule(), you specify macro() and provide a set of attributes and an implementation function. The names of the targets and outputs that these macros produce are constrained to a set of well-defined patterns.

  • Limitations: One known deficiency is that **kwargs doesn’t work as it used to because there is no way to specify this in an attribute list. The way to mitigate this and make it easier to wrap rules will be attribute inheritance. Similarly, while symbolic macros can lead to lazy expansion, this is not yet implemented. Both of these limitations will be addressed later in the 8.x series.

  • Epilogs: Legacy macros used to call native.existing_rules() and lead to the “epilog macro” idiom that can be found in large monorepos. These macros are expected to appear at the end of a build file… but there is no way to enforce this. With symbolic macros, this idiom is now a supported feature via the finalizer=True attribute, which then makes these macros be expanded after the full build file has been processed irrespective of any other targets.

  • Downsides: Symbolic macros were well-received by conference attendees but have received some pushback inside of Google. The reason is that, because these macros will allow lazy evaluation, it’s possible that they’ll encourage even larger packages that only become problematic in some (critical) scenarios.

  • Legacy macros: There are no plans to remove legacy macros right now—and even if there were, it’d take years to eliminate them due to the need to remain backwards-compatible for multiple releases.

Queries

For people coming from other build systems, the ability to query a build graph may sound like an alien concept. But Bazel has first-class features to do just this, and they can come in handy when writing automation to modify the source tree or to select tests for execution.

There was a full 30-minute talk dedicated to queries. I do not intend to repeat everything that was said because everything is in the query documentation… but I did learn a bunch of stuff:

  • Target provenance: bazel query --output=build ... shows “fictitious” attributes like generator_{name,function,location} which indicate what produced a target. Useful to understand what macros and globs are doing. These attributes can also be used in filters.

  • Variables: Queries can have variables in them to avoid repeating complex parts. For example: let v = foo/... in allpaths($v, //common) intersect $v.

  • Removing implicit dependencies: These often pollute query results with noise and can be removed with --noimplicit_deps. Arguably, this flag should be the default.

  • Graphing: It is possible to generate a Graphviz file by using --output=graph and then graphing it with dot -Tsvg -o graph.svg. It is important to filter the query or otherwise the resulting graph is overwhelming—aka useless.

  • Running code on the graph: The cquery command supports --output=starlark --starlark:expr=... which allows running a piece of Starlark code on every target selected by the query. The current target is bound to the target variable. This is useful to, for example, query transitive runtime JARs. And because Starlark is almost Python, you can explore the API by using the dir function with expressions like dir(target) or dir(target.actions).

  • Action queries: The aquery command is a different beast from the other queries and allows inspecting the action graph. Something like bazel aquery 'mnemonic(CppCompiler, //...)' shows all inputs, outputs, and command lines for every action. This can be particularly useful to understand toolchain configuration changes.

  • Filtering by file: The outputs and inputs filters can be used to filter actions by paths: e.g. you can find actions that produce a particular file with outputs(".*path/to/h", deps(//...))). (Beware that the path filter is an anchored regexp.)

  • Performance: Running many queries in a loop doesn’t scale. Consider building a bigger query if possible and then do post-processing (the opposite advice of when talking to a database server), or using an aspect to dump all info in a single run.

Linting

Back at Google, Jin and I hosted an intern circa 2017 to work on “autolint”: an innovative tool that was supposed to automate running arbitrary lint checks on arbitrary codebases. The project never materialized… but Aspect.dev just announced rules_lint which basically does the same thing and integrates with Bazel neatly.

To summarize the talk, rules_lint provides two disjoint features in one:

  • Formatting: The goal of formatting is to change a codebase to adhere to predefined standards and stop the bitchering about white space. A quote from the talk I liked was “because while your code might be art… it isn’t ASCII art”.

    • Broad support: Supporting formatters is easy, and there are formatters for almost all languages, so a goal here is to try to support as many as possible.

    • Speed: Formatting must be deterministic and fast so that it can be hooked into the IDE and/or in pre-commit hooks.

    • Not part of the build graph: It’s tempting to try to hook formatting as an action, but formatting requires unconstrained access to the workspace to modify source files, and many source files that you’d like to lint don’t necessarily have build rules (e.g. Markdown documentation).

    • Implementation: Formatting relies on rules_multitool which automates the process of downloading a bunch of tools and rules_multirun which allows running multiple tools in parallel.

    • Git blame: When applying formatting changes, don’t forget to register the commits that did so in .git-blame-ignore-revs. This will prevent your name from showing up in git blame operations as the author of all files.

  • Linting: The goal of linting is to detect potential problems in the code, but the constraints are different: the problems aren’t always conclusive, the solutions aren’t always unique nor safe to apply automatically, and analyzing a file may require analyzing more than one at once.

    • Speed: Linting is slow because it often requires using tools akin to compilers. As such, it should never be added to pre-commit hooks and should instead be plumbed through in CI.

    • Implementation: The desire is to have a uniform interface for linting all languages and don’t want to have extra rules nor modify build files. The obvious answer to this is to use an aspect which then augments existing rules with reports and patches to fix up the source tree. Validation actions were not an option because using them requires modifying the rulesets, and a premise of this work was to not do that.

    • Interface: Invoking an aspect and handling the resulting patch files is complicated, so Aspect.dev has augmented the Bazel client (the CLI) with an additional lint subcommand that automates this process. Interesting, and I suppose this could be turned into a pluggable mechanism by making Bazel look for bazel-<command> binaries at a well-known location (like the git dispatcher does). This reminded me of Twitter’s Pants build system, which is much more than just for building.

Testing

Bazel is often sold as a “build tool” but its secret sauce is that it really is a “test tool”. The true benefits of using Bazel come from realizing that not all tests have to run all the time, and that their results can be safely cached. Various talks touched upon how to do testing effectively:

  • Profiling data: Collecting profiling details of tests by default, and exposing these details as test outputs, is very useful because developers can then use this information to understand how to optimize long tests on their own. AirBnB reported that they do this.

  • Test granularity: We like to think that each test deserves its own test target… but that’s not necessarily how users think or how they interact with tests. Many times, developers end up running certain groups of tests in unison, and if they always do that, combining those tests under a single target will save build and execution time at the expense of a less “pure” dependency graph.

  • Tests in the build: In certain scenarios, users would like to depend on test outputs as part of the build process. One example that Luminar Technologies provided was the need to generate a detailed report of test results and code coverage for industry compliance reasons. The report generation should be an action that depends on the test outputs, but that’s not doable. As a compromise, they made the tests run as part of the build by replacing cc_test with a macro that spawns a non-test action with a custom runner (but only on CI).

  • Smoke tests: Related to the previous and during the unconference, Luminar Technologies also brought up the desire to gate parts of the build and test process on “smoke tests”. In other words: once issuing the build/test operation, if we detect that a simple test fails, we shouldn’t continue the build or run any of the more expensive tests. The audience seemed to agree that this belonged into a separate tool. Ulf reminded us that there should still be a flag called --test_keep_going, which can be set to false to stop execution as soon as one test fails.

  • Test selection: When folks adopt Bazel, their initial approach to CI is to simply do bazel test ... in a job. This works well… until it doesn’t, because even in the presence of test caching, processing the whole monorepo becomes expensive. Uber explains how even an rdeps query becomes expensive and they’ve had to find other solutions. A key insight is that 71% of changes to their monorepo only change source files and only 15% change at least one build file. This means that the majority of changes do not modify the build graph structure significantly, so there is room for reusing state from previous builds (e.g. keeping the Bazel server alive) to compute newly-affected targets.

  • Flaky test granularity: Bazel detects flakes at the test target level… but a test target can contain tens or hundreds of individual test cases. Which means that if a single test case is flaky and we conclude that its target is flaky, subsequent builds could skip lots of valid test cases. By implementing custom flaky test case detection via junit.xml report files, Uber was able to surface 3500 additional tests that would have been skipped by the naive target-based approach and could re-add those to the build.

  • Profile-guided test execution: In large test invocations, it’s common for one test to be the long pole. If by lack of luck this test ends up running last, it ends dragging the critical path of the test run and lengthening the p99 of test executions. It’d be nice if Bazel supported “profile-guided execution” so that such long poles could be scheduled earlier. Jin reported that this has been looked at but one problem is that there hasn’t been a good place to store this information across invocations. Ulf mentioned that when we have async action execution (which Java 21 makes possible), we should be able to improve action scheduling and implement this.

  • Actions per test: One interesting detail that Ulf shared is that test targets generate three actions, possibly more: one to execute the test itself, another to generate the junit.xml file if the test failed to do so, and another to perform coverage post-processing. If retries, sharding, or --runs_per_test are involved, there can be even more actions. So, if it becomes necessary, there is precedent to add more actions to perform other types of post-processing.

  • CI runners: A topic that came up during the unconference was that starting Bazel on CI runners is very painful because of the need to download repos, repopulate the local disk cache, and rebuild the analysis graph. The general advice is to prevent discarding CI runners across builds, but that’s not always possible (e.g. security might not like to share runners across different jobs). For Kubernetes, stateful sets provide a mechanism to preserve the disk cache even during auto-scaling.

Missing tooling around Bazel

A fun topic that arose in the unconference was: “what support tools do you wish Google had open sourced when they published Bazel, or that the community had built?” There were many ideas from the audience:

  • Scripting language for use in genrules: Writing genrules that work across systems is difficult because they invoke the shell, and the tools that are often called do not always support the same syntax (thanks GNU). It’d be nice to have a language that allowed writing genrules in a platform-agnostic way. WASM was brought up as well as having a Starlark interpreter with file system access.

  • Affected target determination: The goal of this tool would be to answer the question of which targets are impacted by a change to the source tree. bazel-diff more or less does this. An rdeps query sounds like the right solution, but in practice it is tricky because changes to a toolchain aren’t visible in a query; they do show up in cquery though, but then you have to be aware of all possible configurations to do the determination correctly.

  • Build queuing: The goal here is to have a service that can queue entire builds and not just actions. This might exist in some CI systems already.

  • Unifying bazelisk with bazel: Users are confused with the fact that sudo apt install bazel is available but generally does the wrong thing, and that it’d be good if bazelisk’s features existed within Bazel. I explained that we built something similar to bazelisk at Snowflake but for arbitrary tools as a way to hijack command line executions and redirect them to checked-in tools; maybe we should open-source it.

  • buildozer for bzl files: Automating edits to bzl files would be handy, although this is a hard problem. Maybe it could be scoped down to e.g. automatically clean up old repository rules that aren’t referenced.

  • Java library to programmatically edit build and bzl files:. The current Starlark library isn’t useful because it’s not standalone and it doesn’t preserve the original file structure. A workaround is to extract the AST from Python because these files remain syntactically valid from Python’s perspective.

  • Documentation website for all the rules: The BCR could fulfill this, but it currently does not have links to documentation.

The end

And finally, here are the few leftover toughts that I did not know how to fit in any of the sections above:

  • Nix: There were many talks about Nix on the original submissions, but only a couple made it to the final schedule. The folks at Modus Crate described how they use rules_nixpkgs to build external dependencies and how they stage those in an NFS-shared /nix/store/ tree. Because Bazel invokes Nix and stages the dependencies during its external repository phase, the execution phase and the remote workers can assume that the Nix packages are in the right locations and things just work.

  • RPM integration: In a similar spirit to using Nix to install packages and/or build containers, we also have bazeldnf. This ruleset provides the rpm repository rule to import RPM packages and the rpmtree build rule to combine one or more RPM into a tarball that can later be fed to other rules like container_layer. In this context, you might also be interested in the patchelf ruleset.

  • Migrations: Migration talks are getting old, but there were some good insights from MongoDB’s talk. One such insight was that, as in any software rewrite, the “old” build system doesn’t stay still during a migration, and the set of features to migrate grows over time. Another such insight was that, if you can couple the new build system to the old one (by making one rely on the other for increasing portions of the code), you can minimize the time in which new features get added to the old build and not the new one. The downside is that you’ll be building throwaway work, but overall it will pay off.

  • My talk about Snowflake: I gave a 15-minute “migration” talk about what we are doing with Bazel at Snowflake. Because of what I said just above, I skipped over all of the migration topics and instead focused on explaining 5 different technical challenges that we faced and how we fixed them. I was met with lots of questions after the talk, which was really nice, but which means that we owe a full migration post when we are done (very soon now!).

I hear you want the raw notes? Are you sure? I really don’t recommend that you look… but if you must, here they are: BazelCon day 1 notes, BazelCon day 2 notes, and the Build Meetup’s notes.