Bazel's dynamic strategy

After introducing Bazel’s dynamic execution a couple of posts ago, it’s time to dive into its actual implementation details as promised. But pardon for the interruption in the last post, as I had to take a little detour to cover a necessary topic (local resources) for today’s article.

Simply put, dynamic execution is implemented as “just” one more strategy called dynamic. The dynamic strategy, however, is different from all others because it does not have a corresponding spawn runner. Instead, the dynamic strategy wraps two different strategies: one for local execution and one for remote execution.

But which strategies does it wrap? Well, that’s configurable via the --dynamic_local_strategy and --dynamic_remote_strategy flags, which behave in the same manner as --strategy but affect the different code paths within the dynamic scheduler for specific action mnemonics. Their defaults are sandboxed and remote, respectively.

With that in mind, the conceptual behavior of the dynamic strategy is simple:

Slide 15 of my talk on dynamic execution for BazelCon 2019

As you can see from the figure, the dynamic strategy delegates execution of the spawn given to it at exec() time to the remote and local strategies at once and then waits for their completion. Whichever strategy terminates first wins, and the other is stopped. This all happens inside LegacyDynamicSpawnStrategy#exec by means of Java’s Executor#invokeAny primitive.

invokeAny is an executor method that takes two futures, schedules them to run, and polls for their completion. When one finishes, invokeAny actively cancels the other future (which, as we will see later, is where a lot of the difficulties lie). And as you can imagine, what these futures do is execute the exec() method of the wrapped strategies.

An important and easy thing to notice in the figure above is how the number of jobs and the local resource tracking play together. Because we want to benefit from the massive parallelism of remote execution, we must configure --jobs to a high value (think hundreds) so that we have enough threads entering the dynamic strategy that can delegate to the remote strategy. But allowing all those threads to also delegate to the local strategy would overwhelm the local machine… so those are gated on checking that there are sufficient local resources left.

This design allows the dynamic strategy to provide the best possible clean and incremental build times. For clean builds, we mostly benefit of the breath of remote workers accessible via remote execution, while maybe also running some actions locally. And for incremental builds with little parallelism, we mostly end up running everything locally and can transparently take advantage of features designed to aid in the edit/build/test cycle (e.g. persistent workers).

But note something: by the time an action’s execution has reached the strategy, it’s already too late to make any smart decisions on which actions to run. We already know we have to send a spawn to remote execution, and we will try our best to run it locally if resources permit… but that’s about it: there is no way to be smart about what parts of the build graph should only run locally or only remotely. And this is why I do not like calling this feature “dynamic scheduling” because… there is not much scheduling taking place. You can probably imagine some good heuristics to improve this, and we’d like to implement those, so this is an open area of work. But, for now, this simplistic design already yields surprisingly good results.

Now, while this design is simple, the devil is all in the details. And there are a lot of details to take care of which are not easy to handle. We’ll dive into some of these in the upcoming posts.

This article is part number 2 of 6 of the Bazel dynamic execution series.

Featured software

Featured posts