Here I am, confined to my apartment due to the COVID-19 pandemic and without having posted anything for almost two months. Fortunately, my family and I are still are in good condition, and I’m even more fortunate to have a job that can employ me remotely without problems. Or can they?
For over a year, my team and I have been working on allowing our mobile engineers to work from their laptops (as opposed to from their powerful workstations). And guess what: now, more than ever, this has become super-important: making our engineering workforce productive when working remotely is a challenge, sure, but is also an amazing opportunity for the feature we’ve been developing for over a year.
Not everything is great though. We have been offering a workflow that can be used from laptops for a while, but we have traditionally not cared much about things like “bad” network connections. We have been focusing on supporting engineers that work on Mac laptops while in the office (maybe from one of our fancy cafes), but not from really from home where the network may be slow.
Except… today, working from home, the possibility of having a “bad” network connection is the common case, not the exception. All these little issues that show up when your network has high latency or low bandwidth? They are not subtle problems any more: they are the key differentiator between having a productive day and a miserable day.
Where am I going with all of this? I just want to tell you the tale of a bug fix, as usual.
One of the issues we encountered revolved around the simple act of starting the Bazel binary. A few users reported that, upon typing bazel build ...
, the macOS terminal (and sometimes even the browser!) would freeze for so long that they would get annoyed and reboot their computer. Incidentally, that wouldn’t fix the problem, and once you understand all the pieces involved in the process, that’s not surprising.
As you may know, here at Google we store all of our codebase under a monorepo—and that includes all the binary tools that are needed to bootstrap a build. One of these tools is the Bazel binary itself, which we put under a path like tools/bazel
.
As you may also know, our source tree is so big that you cannot simply “clone” it onto your computer: we have a FUSE file system that exposes a lazy view of this tree. Whenever you access a source file (or a build tool) exposed by this system, the FUSE daemon lazily fetches the file from the network, stores it in a local cache, and then behaves as if the file had always been there.
With these pieces of information, you can start piecing together was was going on:
- The user types
bazel build ...
. - The shell runs an
exec(2)
system call ontools/bazel
, which transfers control to the kernel. This is game over. - The kernel contacts the FUSE daemon to access
tools/bazel
. - The FUSE daemon does not find the binary in its cache so the daemon starts fetching the binary from the network.
- The network is very slow, so downloading the multi-MiB binary from the network takes minutes.
- The kernel waits for the FUSE daemon to respond. In the meantime, the terminal application queries the name of the running process to update its title bar and gets stuck because the kernel is waiting for the
exec(2)
system call to start the process. - Eventually, the FUSE daemon finishes the download and unblocks the
exec(2)
system call, which in turn unblocks the terminal and whichever other applications got stuck. That is, if the user didn’t give up earlier than that and started taking drastic measures.
Yikes. This sounds like a nasty problem because, as soon as we do exec(2)
on a binary that was not cached, we are at the mercy of the network connection. If the network is slow, that operation can take forever and we may leave the system in an unusable state.
How do we fix this? At a first glance, it doesn’t seem possible—but we can try to prevent entering this situation in the first place. What if we don’t allow exec(2)
to get stuck on a network call? If we can ensure that the binary is always in the FUSE daemon’s cache before we attempt the exec(2)
, the execution should not stall.
Easy in theory, but how do we make this happen? We need an intermediate step between the user command invocation and the binary execution in order to intercept the call and do additional operations.
Fortunately, in the sequence of steps above, I omitted one important detail. You may have noticed that I said: the user types bazel build ...
and then we get stuck running tools/bazel
. Wait, how did that happen? Do we have ./tools
in the PATH
? No, of course not!
What we have is a little wrapper that we install under /usr/local/bin/bazel
outside of FUSE. This wrapper’s sole purpose is to verify that we are under a Google source tree, to check that tools/bazel
exists, and to then delegate execution to the checked-in binary. This wrapper is the perfect place to put our fix: we can change the wrapper to cause a prefetch of the binary without executing it.
The fixed workflow looks like this:
- The user types
bazel build ...
. - The shell executes
/usr/local/bin/bazel build ...
. - The wrapper opens
tools/bazel
for read and scans through the whole file in small chunks. If reading the file takes longer than a couple seconds, we now have a chance to print a warning to let the user know that we are waiting for a network download. - Once we finish reading the whole file, we know that the file is in the FUSE daemon’s cache and in the kernel page cache.
- The wrapper runs an
exec(2)
oftools/bazel
which now can complete without blocking.
Voila. By trading the exec
call for a controlled read of the file, we avoid losing control to the kernel. This prevents the terminal from getting stuck, allows us to note slow connections, and recognizes Ctrl+C requests from the user.
On the other had, these reads might sound like wasted work. In practice, however, they are not a big problem. When the local cache is cold, this prefetching operation is actually doing useful work because it’s bringing the remote data onto the machine and it’s warming up the kernel’s page cache. When the local cache is warm, all reads complete in a very short amount of time: maybe they do actual work in bringing the binary into memory, but once it’s in the page cache, the subsequent execution will run faster.
But yes, it’s somewhat wasted work. In an ideal world, we’d have a side channel to contact the FUSE daemon and ask it to prefetch the file in one go, returning quickly if the file is in its local cache. If we had that feature, we wouldn’t need to read the whole file through the FUSE layer. This would be a pretty reasonable and simple feature to implement, but the downside is that this would increase the coupling between our trivial wrapper and the FUSE file system. We may look into this if the prefetching turns out to be a problem in any case, but I doubt it.
To conclude: remember that a very subtle problem that was never a big deal turned out to be problematic when the situation changed. And what sounded like an unfixable bug was actually tractable after understanding all the pieces involved.