Blaze—the variant of Bazel used internally at Google—was originally designed to build the Google monorepo. One of the beauties of sticking to a monorepo is code reuse, but this has the unfortunate side-effect of dependency bloat. As a result, Bazel and Blaze have evolved to support ever-increasingly-bigger pieces of software.
The growth of the projects built by Bazel and Blaze has had the unsurprising consequence that our engineers all now have high-end workstations with access to massive amounts of distributed resources. And, as you can imagine, this has had an impact in the design of Blaze: many chunks of our codebase can—and do—assume that everyone has powerful hardware. These assumptions break down as soon as you move into Bazel’s open source land: while knowing where the product really runs is out of hand, we can safely assume it is certainly being used on slower hardware.
Another situation where these assumptions break is when you try to make Blaze run decently on lower-powered machines. Think: laptops, which are not as strong as workstations but they are as strong as they’ve ever been. This is precisely the work I’m doing these days, and these efforts will spill into Bazel. Mind you, they already have, and in this post I’d like to focus on one of my recent changes.
In order to execute an action (e.g. compiling a single source file or linking the final binary) on macOS, Bazel must first determine where the right Xcode SDKs are. To do so, Bazel runs the
xcrun tool and parses its output. To avoid the penalty of an extra fork+exec per action, Bazel caches the SDK locations so that the costly discovery process runs only once per SDK type/version.
The problem is that… due to initial assumptions that never materialized, this cache was stored on disk. As a result, each action had to open the cache file, load its contents, parse them, and decide what to do. Not a big deal, you may think… but consider what happens when you have a medium-size build with 20,000 actions: assuming a mere 1ms overhead per action, you get a total of 20 extra seconds for the whole build. Divide by the number of cores and you get a more accurate picture of the penalty: on a powerful 12-core workstation, that’s less than 1 extra second of wall time—nothing to lose sleep on. But on a 2-core laptop, that’s 10 extra seconds… which, you know, maybe deserves some optimization.
To fix this, I moved the “Xcode locations” cache from disk to memory: there really was no reason to persist it at all. This simple reshuffling cut about 3 system calls (open, read, close) per action. Based on performance tests, this translated to a ~1% reduction of the original 1,500 seconds said build took on my Mac Pro 2013. Doing some more math: that’s about 2ms less per action.
“But 2ms for just 3 system calls is crazy!” you may say. And you are right: system calls cannot possibly be that costly. But macOS isn’t proving to be super-efficient under extreme load… and Bazel likes to inflict pain on the machine. (Anecdotally, because I don’t yet have good data to back it up, I’ve observed file system related activity experience a 10x slowdown under these conditions.)
What am I getting to? Whatever you do in the critical path is important for performance reasons, obviously. The problem is that, sometimes, it’s hard to know what the critical path even is. And when you do know, it’s too easy to disregard counting system calls as “premature optimization” when it actually isn’t.
This all brings to mind Raymond Chen’s Why wasn’t the Windows 95 shell prototyped on Windows NT? post. In there, Raymond explains that Windows engineers had to use Windows’ recommended hardware in order to develop the system and that they were generally not allowed access to more powerful machines. Performance dogfooding, basically, which should be applied to oh-so-many products out there…