Since the publication of Bazel a few years ago, users have reported (and I myself have experienced) general slowdowns when Bazel is running on Macs: things like the window manager stutter and others like the web browser cannot load new pages. Similarly, after the introduction of the dynamic spawn scheduler, some users reported slower builds than pure remote or pure local builds, which made no sense.
All along we guessed that these problems were caused by Bazel’s abuse of system threads, as it used to spawn 200 runnable threads during analysis and used to run 200 concurrent compiler subprocesses. We tackled the problem by reducing Bazel’s abuse (e.g. commit ac88041) of system resources… and while we saw an improvement, the issue remained.
So we collectively chalked these slowdowns up to macOS being bad under heavy load. But I refused to believe this: I have witnessed Macs experience very high load (for example via
cargo build, via NetBSD’s
build.sh, and via Lightroom) without noticing this misbehavior, so I’ve been chasing another root cause on and off since then.
The underlying cause
Kinda by coincidence, a coworker recently mentioned the existence of QoS service classes in Darwin. From his point of view, this was interesting because those would let us reduce battery consumption on laptops (and that was the internal-only bug he filed, which I later paraphrased in issue #7446).
But a loud alarm fired in my head. QoS involves scheduling and scheduling involves choosing some things over others based on a priority. If we can set QoS levels on a thread basis, then surely this also affects performance in some ways we didn’t know. And maybe this explained why
nice 20 didn’t make a difference. Right? Right? And indeed it does.
I started investigating by running the following1 during a build:
sudo powermetrics --show-process-qos --samplers tasks -n 1 | head -n 30
and quickly noticed that Bazel was running at a higher QoS class than the system services it requires to function efficiently. In particular, Bazel was running at the Default class, and services like srcfs (the FUSE file system that serves the Google monorepo) and OpenVPN (which we require to contact remote execution) were running at the (lower) Utility and Background priorities respectively.
Uh oh: Bazel’s overloading of resources prevented the system from scheduling important services, some of which are required by the kernel to respond quickly (e.g. FUSE) or by applications to avoid dropping network connections (e.g. OpenVPN).
This is when I started to believe that, maybe, just maybe, Darwin is actually pretty good at scheduling and we are the ones who set up us the bomb.
Was this theory true? To test this out, I quickly hacked some changes to Bazel to make itself lower its own priority class to the lowest one, Background, and tried a build. And lo and behold, the build using dynamic scheduling completed quickly enough and the machine remained as responsive as ever. No UI stuttering, no network latency increases: just a perfectly-functioning machine.
But this wasn’t a great solution: Background, being the lowest priority, has no scheduling guarantees. In fact, I noticed that the machine remained about 25% idle during a build with this setting so this was no good—builds would be slower overall even if we kept the system snappy, and developers always complain about slow builds no matter how fast you make them (I do too, of course).
A better solution was to fix any system services running as Background to run, at least, at Utility in order to ensure their scheduling. This was easy enough to do via launchd settings. And with this done, I had to lower Bazel to that same priority2 to not starve the services it needs.
Making this happen in Bazel was hard though and is still far from perfect, but it seems good-enough for now.
As you may know, Bazel is implemented as a client/server application where the server is a background process that remains running on your machine and the client “just” proxies I/O to the console. As it turns out, this architecture is beneficial in our situation because the only way I’ve found3 to change the QoS service class of an application—and thus change the default class of all of its threads and subprocesses—is via a Darwin-specific extension to
Given that the client already has to spawn the server as a separate process, it made sense to just use this extension to run the server at the right priority. But this was not easy: the code used the
exec combination and went through great extents to make this work reliably under a multi-threaded scenario—though no matter how careful, it was still buggy. So the first step was to convert this piece of logic to
posix_spawn (commit 18a0e23), which in itself allowed me to move a lot of the tricky logic into a separate single-threaded subprocess.
With that complex surgery done, setting the QoS service class of the server process under macOS was trivial (commit 0877340) and easy to review.
Considering the simplified solution we have now, did it make a difference? You bet!
I can finally run Bazel builds on my 2-core laptop without killing it. But that’s not very impressive, right? What is impressive are the improvements we are seeing in builds that use dynamic scheduling: users that reported 30-minute long builds in the past are now claiming 5-minute long builds. And all because we don’t starve network-related services.
Pretty cool, huh? Preliminary results looks very promising but we still have some issues left to address. For example: Bazel’s own resource abuse can still compete with other services at its same priority—but that’s a more “normal” problem to tackle.
The takeaway is simple: if you develop for macOS, be very aware of QoS service classes.
The amount of instrumentation within Darwin to troubleshoot power-consumption issues is astonishing and is probably the reason why iPhones last so long with a single battery charge. Or at least that’s what I heard from a previous Apple engineer, and this seems to confirm it. ↩︎
Making Bazel and its dependent services run at Utility priority works but is probably not the greatest solution. The best solution would be to switch to XPC for inter-process communication because, then, the QoS level of an operation would flow through all components. For example: if Bazel, running as Default required to open a file served by srcfs, srcfs would respond at the same level instead of a lower one, ensuring that the end-to-end request was handled properly. But the question is: how do you make this happen in FUSE? And how do you implement this while keeping the code portable? ↩︎
How did I find the
posix_spawnextension, you ask? Well, I noticed that launchd can tune the class of the daemons it spawns, so surely there had to exist an API in the system to do so. A
ripgrepcall away over
/Applications/Xcode.app/quickly yielded the relevant system primitives. Yes yes, don’t depend on undocumented APIs blah blah blah, but the ones I found do have docstrings in them! Ugh. ↩︎