This article is part number 6 of the Bazel dynamic execution series.

To conclude the deep dive into Bazel’s dynamic spawn strategy, let’s look at the nightmare that tree artifacts have been with the local lock-free feature. And, yes, I’m double-posting today because I really want to finish these series before the end of the decade1!

Tree artifacts are a fancy name for action outputs that are directories, not files. What’s special about them is that Bazel does not know a priori what the directory contents are: the rule behind the action just specifies that there will be a directory with files, and Bazel has to treat that as the unit of output from the action. Other than that, tree artifacts are “just” a different kind of output2.

Let’s consider a tricky action that spawns a script that goes like this:

  1. Create the tree artifact directory.
  2. Write a bunch of transient or temporary files in the tree artifact directory.
  3. Write a bunch of output files in the tree artifact directory.
  4. Remove the transient files immediately before terminating.

The folllowing figure depicts these steps as they happen when the action is run remotely and locally by the dynamic strategy:

The important boxes in this figure are those labeled Delete temp files. Their location respective to other steps is also important.

Now let’s consider what happens when the dynamic scheduler decides to cancel each of these separate paths, and especially what happens if we are unlucky enough to stop the spawned script before it is able to do its final Delete temp files phase.

For the remote branch, the whole action and script run remotely, which means all real and transient outputs are written in a disk of a remote machine. Because we have cancelled the action and we haven’t even started downloading any of its outputs, there is nothing to worry about: it’s impossible for Bazel to ever encounter the transient files even if they were behind.

For the local branch, things are different… Remember that we are running with the local unsandboxed strategy for performance reasons, so the real outputs as well as the transient outputs were written to the output tree. If we kill the spawned script before it gets a chance to delete the transient files, those will stay behind in the output tree.

And that’s a problem. Because then, when the remote branch continues and fetches the outputs, the transient files will be there. After the downloads, Bazel will sanity-check via ActionMetadataHandler#getTreeArtifactValue, called from SkyframeActionExecutor#checkOutputs, that the on-disk collection of files matches what was downloaded from the remote worker… and encounter an inconsistency, thus crashing.

The solution is easy, though it required a few iterations to get it in the right place. Simply put, we improved the local strategy’s cancellation code path to destroy any tree artifacts as part of its cancellation process. This way, we pretend that the tree artifact directories were never created and the remote strategy can proceed to fetch the outputs. You can see this in revision 0dd9ae90. (And you can also see in the diff some test code to implement a Starlark rule that generates a tree artifact.)

This concludes my deep dive into the dynamic scheduler. At least for now, until we have finished dealing a remaining race condition, have tuned the implementation, proved that it delivers even better performance than the legacy one, and the legacy implementation is gone.

Hope you have a happy New Year’s Eve if you happen to celebrate tonight. Bye bye 2010s. See you on the other side!

  1. Do you also think that the next decade starts on 2021, because there was no year 0? Good try, but dates are hard and some numbers do have special meaning to people. So I’m going to continue believing that the 2010s end today. ↩︎

  2. Except that you cannot declare a tree artifact from a genrule. Well, you can specify a directory as an output and it will work for local actions, but this will break when the genrule is run remotely. Tree artifacts can only be defined from Starlark rule implementations at this point… which is odd and needs fixing. ↩︎

What did you think about this article? (Experimental)

Want more posts like this one? Take a moment to subscribe.

RSS feed