Bazel’s dynamic execution is a feature that makes your builds faster by using remote and local resources, transparently and at the same time. We launched this feature in Bazel 0.21 back in February 2019 along an introductory blog post and have been hard at work since then to improve it.
The reason dynamic execution makes builds faster is two-fold:
- first, because we can hide hiccups in the connectivity to the remote build service; and,
- second, because we can take advantage of things like persistent workers, which are designed to offer super-fast edit/build/test cycles.
Put in numbers, here is what dynamic execution looks like for a relatively large iOS build I measured at Google a few months ago:
Notice how the Dynamic column has the best numbers of each row. Essentially, dynamic execution has the potential to match (and improve) the Remote timing for clean builds and the Local timing for incremental builds.
Unfortunately, the table above shows best-case metrics obtained from a controlled benchmark with an idle machine and good network connectivity. In real-world deployments, we observed worse incremental build times with dynamic scheduling—and we traced these down to the way the dynamic scheduler was originally written.
But how did we find these issues, you ask? By reasoning about the inner mechanics of dynamic execution and making informed guesses about where the problems might lie. When the dynamic scheduler was first introduced, it was written with Google-internal builds in mind. That is:
- very (really) large builds,
- always executed remotely,
- from very powerful workstations,
- but which want to take advantage of persistent workers to improve the edit/build/test cycle.
In particular, the dynamic scheduler was written to support languages designed for quick iteration like Dart and TypeScript. Or to put it another way: dynamic execution took builds that had been fully remote and added the ability to execute a subset of them locally for increased incremental build performance.
But the constraints of the Bazel ecosystem, as well as iOS builds within Google, are very different. These builds have:
- traditionally been local-only (thus are much smaller and have good incremental build performance),
- have had annoyingly-long clean build times,
- and want to use remote execution to improve the clean build case.
This different scenario invalidated some of the assumptions in the implementation of the now-legacy spawn scheduler. To correct them, we have pretty much rewritten this feature from the ground up.
In this post series, I will tell you how the dynamic scheduler works internally, where the problems laid, and how we fixed or are fixing them in the new scheduler implementation. If you attended my BazelCon 2019 talk, these posts build upon what I presented there.