As you might have read elsewhere, I’m leaving the Bazel team and Google in about a week. My plan for these last few weeks was to hand things off as cleanly as possible… but I was also nerd-sniped by a bug that came my way a fortnight ago. Fixing it has been my self-inflicted punishment for leaving, and oh my, it has been painful. Very painful.
Let me tell you the story of this final boss.
The problem
A few weeks ago, an engineer in the Abseil team approached me because they were trying to do an optimization to their C++ library (absl
). Their problem was that all of the tests at Google passed… except for the tests that verify Blaze (not Bazel) on macOS, which crashed in an obscure way. I didn’t have a lot of time to look into the problem myself; but, thankfully, trybka@—one of our C++ experts—did and noticed what was wrong by paying close attention to a failing stack trace:
C [libvfs.dylib+0x1beb24] absl::DeadlockCheck(absl::Mutex*)+0x114
C [libvfs.dylib+0x1b9b22] absl::DebugOnlyDeadlockCheck(absl::Mutex*)+0x32
C [libvfs.dylib+0x1b9a4c] absl::Mutex::Lock()+0x1c
C [libvfs.dylib+0x12e67f] base::internal::VLogSiteManager::UpdateGlobalV(int)+0x1f
C [libvfs.dylib+0x12f4bf] $_0::operator()() const+0x2f
C [libvfs.dylib+0x12f489] $_0::__invoke()+0x9
**** JUMPING INTO ABSL IN A DIFFERENT DYLIB ****
C [network.dylib+0xc285ba] absl::flags_internal::FlagImpl::InvokeCallback() const+0x6a
C [network.dylib+0xc28dcd] absl::flags_internal::FlagImpl::SetCallback(void (*)())+0xad
C [network.dylib+0x54ea74] absl::flags_internal::FlagRegistrar<int, true>::OnUpdate(void (*)()) &&+0x24
**** JUMPING BACK TO ABSL IN THE ORIGINAL DYLIB ****
C [libvfs.dylib+0x136cd0] __cxx_global_var_init+0x30
C [libvfs.dylib+0x136d79] _GLOBAL__sub_I_vlog_is_on.cc+0x9
In the stack trace above, you can see how control flow jumps back and forth between two different shared libraries (dylib
s in this post, as I’m focusing on macOS), and how the calls that cross the boundaries are all within Abseil.
Now, you might think: that should be fine. If the two shared libraries we have above, libvfs.dylib
and libnetwork.dylib
, call into another common one, Abseil, the dynamic linker will not load the common library twice. And you’d be right… except… Bazel and Google love building static binaries. There is no libabsl.dylib
in there: the absl
symbols are duplicated across the two shared libraries. Visually, here is what we had in the executable:
And thereby lies the problem: Blaze on Mac was pulling in three separate dylib
s, two of which bundled Abseil and conflicted at runtime. This, in turn, was blocking the submission of the optimization to Abseil I mentioned, which was supposed to bring significant improvements Google-wide. Time was of essence to fix this issue. (Except it wasn’t because the author of the change found a different way to write their code and avoid the issue, but the issue had already gotten my interest.)
The basics behind JNI
Before getting into how to resolve this issue, let’s pause for a second to understand what’s involved in getting JNI code to work with Bazel.
JNI is nothing magical: when you declare a Java function as native
, the JVM will transform calls to that function as calls to a C function whose name is derived from the class that contains the native
method plus the method’s name. For example, if we have a Java function definition like this:
package com.google.devtools.build.lib.unix;
public class NativePosixFiles {
public static native String readlink(String path) throws IOException;
}
The JVM will look for a matching C symbol with the following definition. Note how the symbol name encodes the Java package name, the class name, and the method name in a straightforward way (unlike C++’s name mangling):
extern "C" JNIEXPORT jstring JNICALL
Java_com_google_devtools_build_lib_unix_NativePosixFiles_readlink(
JNIEnv *env, jclass clazz, jstring path)
{
// Native implementation for the NativePosixFiles#readlink(String) method.
}
The way the native function is invoked is irrelevant, but it probably happens via dlsym(3)
. The important thing, however, is that the JVM must have access to these symbols in one way or another, and that the lookup must happen at run time.
In standard Java, the way you do this is by calling System.loadLibrary()
or System.load()
on the native libraries. The former call searches for a library, using operating system-specific naming schemes, in the directories specified by the java.library.path
property. The latter call simply loads the given file, although, surprisingly, the file’s base name must also follow the operating system’s conventions.
But Bazel offers another mechanism in the Java rules. By using a custom Java launcher, you can handle native libraries in a different way, and that’s what we do inside Google. We have a launcher that bundles all native libraries of the Java target into a single object, then links that object into the JRE binary, and then attaches the JRE to the JAR to form a “self-executable” JAR. This final executable does not need to do any System.loadLibrary()
calls to manually pull in JNI dependencies: they are all already in the JVM’s address space because they were put there at binary link time.
Very nice… except we only do that for Blaze on Linux, which means neither Blaze on Mac nor Bazel—on any platform—benefit from this. And this is where things get nasty: our internal-only “common case” is magical, but everything else has grown workarounds to deal with JNI… and they all have problems.
Initial assessment: SNAFU
Here are the many ways we had in Bazel to load the JNI code before any of my changes took place (that is, at or before commit e36e9da
), and the many little details that got in the way of cleaning things up:
The Google-internal Java launcher for Blaze on Linux. Blaze on Linux needn’t worry about JNI at all as its native code is transparently loaded.
The
UnixJniLoader
class, which was explicitly called in Bazel from non-Windows platforms and in the Blaze on Mac case.The
WindowsJniLoader
class, which was explicitly called in Bazel from the Windows platform.The
JniLoader
abstract class, which was a more recent attempt at making JNI code loading more transparent but was only used in a bunch of new classes.Some arbitrary
System.loadLibrary
calls throughout the code in places where it was known that JNI was unconditionally required. I’m not sure if any of these were in Bazel itself, but the Google-internal modules that required JNI indeed did this—and they did so, especially, to load the non-publiclibnetwork.dylib
andlibvfs.dylib
.Checks for an
io.bazel.enableJni
JVM property to determine whether to load the JNI code or not. This property was added to aid the few use cases that could run without JNI (e.g. the bootstrapping process), although it was not correctly recognized throughout the codebase. In particular, Windows unconditionally needed JNI, whereas Unix platforms could work without it.Weird hacks to transform
.so
libraries into.dylib
s for macOS’s consumption.A futile attempt at simplifying JNI usage throughout the codebase by adding the
target-os-unix-native-lib
and thetarget-os-native-lib
targets, which in retrospect made things even more confusing.Knowledge in the top-level Bazel binary packaging rules of the JNI libraries to bundle them into the final self-extracting executable, with extra logic to tell the JVM where they are via the
java.library.path
property.
I’m sure I’m forgetting some interesting details, but you can see from this list that there are a ton of “corner cases” to consider in order to clean things up. I mean… things were really messed up.
There is one thing that did work reasonably well in this design though, and that was the fact that unit tests didn’t have to worry about whether their dependencies needed JNI or not. This wasn’t bullet-proof though: some tests and build rules did have to know about JNI and had to either disable it with io.bazel.enableJni=0
or put the shared libraries in the right place. A very fragile scenario, indeed.
First cleanup attempt
Armed with this knowledge, which was more or less present in my mind before I started but which wasn’t complete until I finished, I started trying to tackle the problem.
IMPORTANT: The key observation to resolve this problem is that, at build time, there is zero coupling between building the Java code that defines native
methods and building the native C code that supports them. This is bad because neither you nor the compiler can’t know if things will work until you run them. This, however, is good because it gives us the freedom to build the native code separately from the Java code, and allows us to glue them together at run time.
My first thought was: let’s remove all dependencies from the Java rules to the native libraries. Then, let’s add a single cc_library
target that generates a single native library by pulling all individual bits as deps
. And, finally, let’s add a trivial System.loadLibrary
statement to Bazel’s and Blaze’s main
function to load their own single artifact.
I prototyped this solution and was excited by the huge amount of garbage I could remove. The final solution was elegant, very easy to understand, and both Blaze and Bazel both worked… except… tests failed left and right. You see: tests don’t invoke Bazel’s main
entry point so they didn’t get a chance to load the needed JNI library in this design. I could have littered the test build graph with explicit dependencies on the JNI library and explicit calls to System.loadLibrary
, but that would have been awful. I was trying to simplify things, not make them worse!
So I had to scrap this idea and go back to the drawing board.
A better approach
The two constraints driving my search for a solution were simple:
Ensure we load a single shared library at run time to avoid name clashes.
I had to loosen this restriction a little bit because our build ends up pulling another shared library that’s out of our control: Netty’s
kqueue
andepoll
native bindings. Upon careful examination, I saw that this library is trivial and doesn’t have any dependencies, so we can ignore it for our purposes. But this highlights that the solution I came up with is not generic enough to work for arbitrary projects.Ensure that any Java target that depends on another target that needs JNI can do so without having to care, at all, that there is a JNI dependency anywhere.
In my opinion, this was an unnegotiable property of the solution. If you have a piece of code depending on something else, you should give zero effs on how that happens: the dependency should just work for you. If the dependency wants to pull in native code, so be it, but you ought not to know.
Simply put, the design I envisioned is depicted in the following diagram:
Simple, huh? We want a single shared library that bundles all individual native libraries at build time to remove duplicate symbols. We want to load that single library at runtime. And we want the contents of this library to support Google-internal extensions when building Blaze inside of Google, and to remain Google-agnostic when building Bazel outside.
Now, the question was: how do we get here, given all the many corner cases we have to deal with?
Steps towards the solution
The general thought process was: let’s try to isolate all JNI loading logic into a single Java package. Once we have that, we can try to make all other Java code depend on it and not have to worry about native dependencies. And to make that possible, we have to rely on extracting the JNI code from a Java resource instead of relying on the caller correctly preparing the java.library.path
property on entry.
More specifically, here are the steps I took:
Bring clarity into the situation by forcing all JNI load requests to happen via a common code path. The end goal for this was to isolate all JNI code in a single place, making it easier to see how to change it. See commit
50adefb
.Simplify the way the Windows code paths loaded JNI. This wasn’t a necessity as part of this project, but was raised as a suggestion during a code review… and I like simplicity, so I did it. See commit
9ce9088
.Now that we had all callers funneling their JNI requests via the
JniLoader
, there was no reason to keep the separateUnixJniLoader
andWindowsJniLoader
classes, so they could be merged. This highlighted some important differences between the two: namely, that Windows relies on the runfiles library clutch to load the JNI code, and that this was only done for testing purposes. Ugly. This had to go, but not yet. See commit7128580
.With the
WindowsJniLoader
gone, there was no reason to keep the separatewindows/jni
subpackage in the source tree… so I merged it into its parent. Again, unnecessary simplification as part of this project, but it was easy enough, so why not? See commitd69744d
.The key piece of this puzzle was replacing the way we actually look for and find the JNI library. Instead of relying on external actors such as the Bazel client, the bootstrap script, or unit tests to put the JNI library in
java.library.path
, we should be able to put the library inside a JAR resource and extract it at runtime. And it was possible. With this, all madness was gone and, finally, once and for all, theJniLoader
could work by trivially pulling it in as a dependency. See commita2a2f9d
.Modify the Google-internal fork of
src/main/native:unix_jni
to link in the previousnetwork
andvfs
targets so that thelibunix.dylib
becomes the only JNI library we have to load at run time. This change was trivial at this point: just a matter of changingBUILD
dependencies and replacing ad-hocSystem.loadLibrary()
calls with calls to our improvedJniLoader.loadJni()
. I cannot link to this change because it has no correspondence in Bazel though, given that Bazel hasn’t yet suffered from this issue (but it could!).
I like how this sequence highlights the importance of decoupling refactoring and cleanup work from the actual fixes. All of the steps in this list except for the last one were essentially no-ops and only laid the ground work to make the last step simple. And simple it was. Doing the work this way made it much easier for my reviewer to see what was going on and to assess that each step worked correctly. (Mind you, I actually broke the build somewhere in between two steps… and it was “easy” enough to fix it because the delta between changes was small.)
And with that, the original bug is now fixed!
Parting words
Executing this cleanup has been very frustrating as I briefly mentioned in the opening. Part of the reason is because of the many corner cases that had to be dealt with and how these were not clear upfront. But the other and major part of this came from the fact that Blaze and Bazel have divergent BUILD
files in some areas, and ensuring they are all consistent with each other is painful. Throw in long BazelCI turnaround times and the need to get things to work on three different platforms… and you are in for pain.
Anyhow. The bug is now fixed. But is what I outlined above the final solution? No, not really. What I’ve done is still an ad-hoc solution to Bazel’s own build process and use of JNI. Anyone writing and building Java code with Bazel also needs to interact with JNI at some point, and those developers are subject to the same issues I’ve described here. I don’t think we have a great answer for them yet. In other words: help welcome.
So long, and thanks for all the fish!