Bazel’s original raison d’etre was to support Google’s monorepo. A consequence of using a monorepo is that some builds will become very large. And large builds can be very resource hungry, especially when using a tool like Bazel that tries to parallelize as many actions as possible for efficiency reasons. There are many resource types in a system, but today I’d like to focus on the number of open files at any given time (nofiles
).
macOS’s default soft limit for nofiles
is pretty low: on all the systems I’ve checked, this is set to 256. Fortunately, the default hard limit is unlimited
so unprivileged users can bump their own limits to support large builds. Having to tweak limits by hand is ugly, so in July 2017 I made Bazel raise its own soft limit to the hard limit in commit a96369c.
This automatic unlimiting wasn’t trivial because of another cap in macOS: the the kern.maxfilesperproc
system property. Neither the soft nor the hard limit can exceed this cap. More specifically, the soft and hard limits cannot be set to a value higher than the kernel’s cap… but it’s possible for them to already exceed the cap (which is strange and maybe a bug in Darwin). To support this, I made Bazel restrict its upping of the soft limit to the minimum of the hard limit and the kernel cap.
Things seemed fine for a while (more than a year!): common build failures vanished after this fix and I assumed this issue was fixed for real. Except… this came back to bite us again recently. Under some adversary conditions, some subprocesses were failing with errors indicating that they had exceeded their nofiles
limit. And this was strange because the kernel cap is a pretty high value.
When I got this bug report, I quickly wrote a genrule
(a custom build action in Bazel) to print its own view of ulimit -a -H
… and found that the nofiles
limit was set to 10240. But the kernel cap was 24576 and the hard limit before starting Bazel was unlimited
. Where was this number coming from? There were no matches of it in the Bazel source tree…
A bit of digging (which I can’t remember right now how I pursued) soon surfaced something pretty interesting: the JVM’s HotSpot also has similar logic during its startup process. This logic is enabled by default and is controlled via the -XX:+MaxFDLimit
option. And guess what, on Darwin only and because the setrlimit(2)
manpage says so, this logic restricts the nofiles
soft limit to the OPEN_MAX
constant, which, drumroll, happens to be 10240.
With this knowledge in mind, the solution was easy: pass the opposite option, -XX:-MaxFDLimit
, to the JVM so that our own higher limited was left untouched. See commit 30dd871. And of course, the hardest part of the solution involved adding an integration test to ensure that what subprocesses see is actually what we wanted.