When the dynamic scheduler is active, Bazel runs the same spawn (aka command line) remotely and locally at the same time via two separate strategies. These two strategies want to write to the same output files (e.g. object files, archives, or final binaries) on the local disk. In computing, two things trying to affect the same thing require some kind of coördination.
You might think, however, that because we assume that both strategies are equivalent and will write the same contents to disk1, this is not problematic. But, in fact, it can be, because file creations/writes are not atomic. So we need some form of mutual exclusion in place to avoid races.
As you can imagine, the simplest form of mutual exclusion is a lock. So… drumroll… that’s precisely what the now-legacy dynamic scheduler implements: a lock per spawn on the output tree. Concurrent branches of the same spawn must grab a lock immediately before writing to the output tree. You can see this in practice in
LegacyDynamicSpawnStrategy#lockOutputFiles, which is a lambda that gets passed to the delegate strategies to grab the lock when they are ready to mutate the output tree.
Conceptually, this looks like the following (note the red boxes):
Let’s look at each branch separately:
Handling remote actions (top branch in the figure) is easy: whatever the action does remotely cannot possibly affect the contents of the local output tree. It isn’t until Bazel downloads any generated outputs that we have to lock the output tree. An important detail here is that remote cache hits/misses are known before taking the lock.
Handling local sandboxed actions (bottom branch in the figure) is similarly easy: as actions run in a sandbox, they write their outputs to a separate tree. We only need to grab the lock once the action has successfully completed in order to move the files to their destination.
Ta-da! Problem solved.
But… as you may remember, sandboxing can be very slow, like 3x slow on macOS—and for our Google-internal iOS builds, sandboxing very much exposes such slowdowns. Hence we “temporarily” disabled sandboxing years ago for interactive builds (and hence my personal interest in sandboxfs).
Consequently, we had to make dynamic execution work with local unsandboxed actions. These are trickier: the local strategy runs subprocesses directly on the output tree. And as subprocesses go, there is “nothing” you can do to tell when and where they end up writing. Therefore, for any local unsandboxed action, Bazel must grab the lock upfront before executing anything.
We end up with a model like the following (note the new branch at the bottom compared to the earlier figure):
Which works reasonably well in the sense that dynamic execution delivers good performance… when networking conditions are perfect. Mind you, this is precisely the environment for which the dynamic strategy was originally written: Google-internal builds that have traditionally been remote-only and that run from high-end workstations with super-fast local networking.
But that’s not always the case, and thus tweaking the dynamic strategy to support standalone execution uncovered other deficiencies that we had to address.
We will look at the two major flaws behind this implementation in upcoming posts. To get your brain started: what happens if we get stuck fetching outputs from the remote side? And what happens if we are unusually slow to notice remote cache hits?
Assuming that the two strategies generate the exact same output, bit by bit, for the same inputs and command line is a leap of faith. But if you are willing to use dynamic execution, you must accept this and make any possible effort to ensure it is true, like using the same toolchains and environment on your remote machines and local machines, and ensuring your build actions are deterministic. ↩︎