The point of this post is simple and I’ll spoil it from the get go: every time you make an assumption in a piece of code, make such assumption explicit in the form of an assertion or error check. If you cannot do that (are you sure?), then write a detailed comment.

In fact, I’m exceedingly convinced that the amount of assertion-like checks in a piece of code is a good indicator of the programmer’s expertise.

But… why? Story time!

The bug

Bazel has a progress reporting message that says how many build actions are currently runnable and how many are actually running. For example, given a --jobs=16 setting on a machine with 12 cores and assuming your build graph has sufficient parallelism, there can be 16 active actions but only 12 can run because that’s what the machine’s resources allow: the other 4 are waiting on those resources. Under these conditions, the progress reporting message would say: 16 actions, 12 running.

Bazel also has a feature known as dynamic scheduling, which we recently announced. With this feature, all actions allowed by the --jobs concurrency degree are sent to a remote execution service and a subset of those actions, as allowed by the machine’s resources, are also executed on the local machine. In other words: a subset of actions run in two places at the same time.

One of our initial adopters for this dynamic scheduling feature reported that Bazel often said 100 actions, 24 running, with varying counts of 24. They were curious to know why this implied that 100 actions (as specified by our --jobs=100 setting) were runnable but only ~24 were truly running. This made no sense because we have sufficient remote execution capacity to allow those 100 actions to run at the same time. Furthermore, the low running counts had no relation to the number of CPUs on their machine.

I wanted to track this down to understand what these numbers were really counting. To do this, I searched for the progress message with the regexp actions,.*running, which quickly yielded ExperimentalStateTracker.java:563. From there, it didn’t take long to understand what was going on.

Some details on Bazel’s architecture

But before we continue, it is important to understand a couple of concepts of behind Bazel’s design:

Bazel is implemented as a loosely-coupled collection of modules, defined by the BlazeModule interface.
To keep the coupling low, Bazel uses Guava’s event bus to post state change notifications across modules.

This all sounds good on paper, but the problem is that the possible set of messages, the conditions in which they get sent, and their relative ordering is… under-specified (which is a mild way of saying “not specified at all”).

Understanding the bug

Why is any of this related? Well, as it turns out, the progress reporting code in ExperimentalStateTracker extensively relies on processing events to process what is going on in Bazel. In particular, this code receives ActionStatusMessage notifications, which indicate action status changes. These messages are “keyed” by the action identifier and are posted, for example, when an action becomes runnable or when an action is truly running.

You may start to see where the pieces fall together. As I said above, dynamic scheduling causes some actions to run twice concurrently… which means we end up with “conflicting” ActionStatusMessage events for the action. It is possible for an action to start running remotely, at which point it is counted by the tracker as runnable and running… but once we try to run the action locally and realize we don’t have enough spare resources, its local counterpart blocks on the scheduling stage and is reported as runnable, not running. From the tracker’s perspective, the action “regressed” and went from running to runnable, which is something that should never happen… and miscounts actions.

Conclusion

Alright, so how does this relate, at all, to my initial claim? Well, the author of the status tracker mentioned that, yes, the tracker assumes that actions are unique. This is a perfectly reasonable thing to assume. But this assumption was… nowhere. Not in the code, not in comments. Which means that the bug in action counting manifested itself when the assumptions were violated and we only knew about it because of an avid user.

Now imagine: if the tracker had validated this assumption, say with an assertion, Bazel would have crashed the moment it detected this action miscounting. Given that this behavior is triggered by a new feature, the developer of the new feature would have immediately noticed the breakage in progress reporting: CI pre-submit tests would have failed and would not have let them check the code in. But instead, the bug persisted and it took some curiosity and digging after-the-fact to find out the issue. Which luckily gave me enough material for a post 😉

This bug is tracked at https://github.com/bazelbuild/bazel/issues/7345.

Fixing this bug seems like a nice starter project to get into the core of Bazel’s architecture! Let me know if you are interested in helping.