Artifact downloads and dynamic execution

In the previous post of this series, we looked at how the now-legacy implementation of the dynamic strategy uses a per-spawn lock to guard accesses to the output tree. This lock is problematic for a variety of reasons and we are going to peek into one of those here.

To recap, the remote strategy does the following:

Send spawn execution RPC to the remote service.
Wait for successful execution (which can come quickly from a cache hit).
Lock the output tree (only when run within the dynamic strategy).
Download the spawn’s outputs directly into the output tree.

Note how we lock the output tree before we have downloaded any outputs, and taking the lock means that the local branch of the same spawn cannot start or complete even if there are plenty of local resources available to run it.

This is a problem because the cost of the downloads can be much higher than the cost of the actual execution—especially with a remote cache hit, because then the execution RPC is instantaneous. When this happens, our dynamic strategy does not abide by its promise of “picking the fastest branch to complete” because it’s possible that running the spawn locally would finish by the time the downloads have completed.

Slide 26 of my talk on dynamic execution for BazelCon 2019

In the slide above, you can notice additional problems: by issuing the downloads after grabbing the lock, the spawn is now network-bound and subject to networking blips. The spawn could take arbitrarily long to complete and the build may get stuck if the downloads don’t complete for any reason. Remember that these builds used to be fully-local and therefore network-agnostic, but the dynamic strategy has made them susceptible to networking problems.

Womp. Womp.

Luckily enough, the solution to this issue is trivial in paper: make the remote path download its outputs to a temporary location and, only once these downloads have finished, grab the lock and move the outputs to their final location. Visually:

Slide 27 of my talk on dynamic execution for BazelCon 2019

In practice, however, this turned out to be quite hard to get right. Some of the issues we faced:

The first attempt at downloading outputs as temporary files just added a .tmp suffix to the downloaded files. This is fine unless there are tree artifacts (actions that generate output directories, not individual files) at play and only if you pass --nolegacy_spawn_scheduler, which I’ll cover later.
The second attempt moved the temporary files to a separate directory and gave them names derived from their full relative path to the output tree. (E.g. an output file a/b/c.o would end up as a_b_c.o.tmp in the separate temporary directory.) This fixed the issue with tree artifacts… but introduced problems on macOS because the file names became too long.
The third attempt just kept the separate temporary directory but gave all temporary files unique names using a monotonically increasing counter.
Lastly, and more interestingly: up until this change, the remote strategy would rarely be interrupted during the download stage. The only case in which this would happen is if the user hit Ctrl+C to interrupt the build, and at that point we only cared about a best-effort cancellation of ongoing work.
But after this change, the dynamic strategy now cancels remote actions routinely while they are downloading files. This exposed various deficiencies in the process that manifested as both stack overflows due to long chains of future listeners, and as memory thrashing.

All these issues are now fixed and we have enabled this feature to download outputs as temporary files by default. Keep in mind, however, that the Google-internal remote strategy is different than the open-source one for historical reasons; I know the public implementation received the fixes to download files in a temporary location, but I’m not sure if those have been carefully tuned to not suffer from any of the problems above.

With these fixes in place, the dynamic strategy should live up to its promise of picking the fastest branch to complete. Furthermore, we can also claim that builds using the dynamic strategy should not be affected by networking blips. Brought to the limit when the network breaks down hard, this means that the dynamic strategy transparently falls back to local-only builds.

Can we prove it though?

Slide 28 of my talk on dynamic execution for BazelCon 2019

In the graph above, you can see the median build times for a relatively large iOS app at Google. Median build times intend to capture the behavior of incremental builds—those where you want the edit/build/test cycle to be as fast as possible.

To the left of the vertical green divider, median build times for dynamic execution (red line) are all over the place: incremental builds using dynamic scheduling can be similar to local-only builds (blue line) but also much worse. To the right of the vertical green divider, we get the opposite: dynamic build times are always better than local-only build times. And what does that vertical divider represent? The point in time when we rolled out the fixes described above, of course! So yes, the fixes paid off.

Unfortunately, fixing this issue exposes another problem. By making downloads happen before taking the lock, we are decreasing the number of cache hits we benefit from: the remote path now takes longer to grab the lock, which means that the local path has more chances of winning than before.

Fixing this is still work in progress and requires the new implementation of the dynamic scheduler, which we’ll touch on next.

This article is part number 4 of 6 of the Bazel dynamic execution series.