As you might have read elsewhere, I’m leaving the Bazel team and Google in about a week. My plan for these last few weeks was to hand things off as cleanly as possible… but I was also nerd-sniped by a bug that came my way a fortnight ago. Fixing it has been my self-inflicted punishment for leaving, and oh my, it has been painful. Very painful.

Let me tell you the story of this final boss.

The problem

A few weeks ago, an engineer in the Abseil team approached me because they were trying to do an optimization to their C++ library (absl). Their problem was that all of the tests at Google passed… except for the tests that verify Blaze (not Bazel) on macOS, which crashed in an obscure way. I didn’t have a lot of time to look into the problem myself; but, thankfully, trybka@—one of our C++ experts—did and noticed what was wrong by paying close attention to a failing stack trace:

C  [libvfs.dylib+0x1beb24]  absl::DeadlockCheck(absl::Mutex*)+0x114
C  [libvfs.dylib+0x1b9b22]  absl::DebugOnlyDeadlockCheck(absl::Mutex*)+0x32
C  [libvfs.dylib+0x1b9a4c]  absl::Mutex::Lock()+0x1c
C  [libvfs.dylib+0x12e67f]  base::internal::VLogSiteManager::UpdateGlobalV(int)+0x1f
C  [libvfs.dylib+0x12f4bf]  $_0::operator()() const+0x2f
C  [libvfs.dylib+0x12f489]  $_0::__invoke()+0x9
**** JUMPING INTO ABSL IN A DIFFERENT DYLIB ****
C  [network.dylib+0xc285ba]  absl::flags_internal::FlagImpl::InvokeCallback() const+0x6a
C  [network.dylib+0xc28dcd]  absl::flags_internal::FlagImpl::SetCallback(void (*)())+0xad
C  [network.dylib+0x54ea74]  absl::flags_internal::FlagRegistrar<int, true>::OnUpdate(void (*)()) &&+0x24
**** JUMPING BACK TO ABSL IN THE ORIGINAL DYLIB ****
C  [libvfs.dylib+0x136cd0]  __cxx_global_var_init+0x30
C  [libvfs.dylib+0x136d79]  _GLOBAL__sub_I_vlog_is_on.cc+0x9

In the stack trace above, you can see how control flow jumps back and forth between two different shared libraries (dylibs in this post, as I’m focusing on macOS), and how the calls that cross the boundaries are all within Abseil.

Now, you might think: that should be fine. If the two shared libraries we have above, libvfs.dylib and libnetwork.dylib, call into another common one, Abseil, the dynamic linker will not load the common library twice. And you’d be right… except… Bazel and Google love building static binaries. There is no libabsl.dylib in there: the absl symbols are duplicated across the two shared libraries. Visually, here is what we had in the executable:

Interaction between Blaze's Java code and the JNI libraries it pulls in on a macOS system. Note how the three separate JNI libraries that we load from Java are all able to reuse the foundational libraries provided by macOS because they are dynamically linked in... but Abseil shows up twice because it was statically linked into two separate shared libraries.

And thereby lies the problem: Blaze on Mac was pulling in three separate dylibs, two of which bundled Abseil and conflicted at runtime. This, in turn, was blocking the submission of the optimization to Abseil I mentioned, which was supposed to bring significant improvements Google-wide. Time was of essence to fix this issue. (Except it wasn’t because the author of the change found a different way to write their code and avoid the issue, but the issue had already gotten my interest.)

The basics behind JNI

Before getting into how to resolve this issue, let’s pause for a second to understand what’s involved in getting JNI code to work with Bazel.

JNI is nothing magical: when you declare a Java function as native, the JVM will transform calls to that function as calls to a C function whose name is derived from the class that contains the native method plus the method’s name. For example, if we have a Java function definition like this:

package com.google.devtools.build.lib.unix;

public class NativePosixFiles {
    public static native String readlink(String path) throws IOException;
}

The JVM will look for a matching C symbol with the following definition. Note how the symbol name encodes the Java package name, the class name, and the method name in a straightforward way (unlike C++’s name mangling):

extern "C" JNIEXPORT jstring JNICALL
Java_com_google_devtools_build_lib_unix_NativePosixFiles_readlink(
    JNIEnv *env, jclass clazz, jstring path)
{
    // Native implementation for the NativePosixFiles#readlink(String) method.
}

The way the native function is invoked is irrelevant, but it probably happens via dlsym(3). The important thing, however, is that the JVM must have access to these symbols in one way or another, and that the lookup must happen at run time.

In standard Java, the way you do this is by calling System.loadLibrary() or System.load() on the native libraries. The former call searches for a library, using operating system-specific naming schemes, in the directories specified by the java.library.path property. The latter call simply loads the given file, although, surprisingly, the file’s base name must also follow the operating system’s conventions.

But Bazel offers another mechanism in the Java rules. By using a custom Java launcher, you can handle native libraries in a different way, and that’s what we do inside Google. We have a launcher that bundles all native libraries of the Java target into a single object, then links that object into the JRE binary, and then attaches the JRE to the JAR to form a “self-executable” JAR. This final executable does not need to do any System.loadLibrary() calls to manually pull in JNI dependencies: they are all already in the JVM’s address space because they were put there at binary link time.

Very nice… except we only do that for Blaze on Linux, which means neither Blaze on Mac nor Bazel—on any platform—benefit from this. And this is where things get nasty: our internal-only “common case” is magical, but everything else has grown workarounds to deal with JNI… and they all have problems.

Initial assessment: SNAFU

Here are the many ways we had in Bazel to load the JNI code before any of my changes took place (that is, at or before commit e36e9da), and the many little details that got in the way of cleaning things up:

The Google-internal Java launcher for Blaze on Linux. Blaze on Linux needn’t worry about JNI at all as its native code is transparently loaded.
The UnixJniLoader class, which was explicitly called in Bazel from non-Windows platforms and in the Blaze on Mac case.
The WindowsJniLoader class, which was explicitly called in Bazel from the Windows platform.
The JniLoader abstract class, which was a more recent attempt at making JNI code loading more transparent but was only used in a bunch of new classes.
Some arbitrary System.loadLibrary calls throughout the code in places where it was known that JNI was unconditionally required. I’m not sure if any of these were in Bazel itself, but the Google-internal modules that required JNI indeed did this—and they did so, especially, to load the non-public libnetwork.dylib and libvfs.dylib.
Checks for an io.bazel.enableJni JVM property to determine whether to load the JNI code or not. This property was added to aid the few use cases that could run without JNI (e.g. the bootstrapping process), although it was not correctly recognized throughout the codebase. In particular, Windows unconditionally needed JNI, whereas Unix platforms could work without it.
Weird hacks to transform .so libraries into .dylibs for macOS’s consumption.
A futile attempt at simplifying JNI usage throughout the codebase by adding the target-os-unix-native-lib and the target-os-native-lib targets, which in retrospect made things even more confusing.
Knowledge in the top-level Bazel binary packaging rules of the JNI libraries to bundle them into the final self-extracting executable, with extra logic to tell the JVM where they are via the java.library.path property.

I’m sure I’m forgetting some interesting details, but you can see from this list that there are a ton of “corner cases” to consider in order to clean things up. I mean… things were really messed up.

There is one thing that did work reasonably well in this design though, and that was the fact that unit tests didn’t have to worry about whether their dependencies needed JNI or not. This wasn’t bullet-proof though: some tests and build rules did have to know about JNI and had to either disable it with io.bazel.enableJni=0 or put the shared libraries in the right place. A very fragile scenario, indeed.

First cleanup attempt

Armed with this knowledge, which was more or less present in my mind before I started but which wasn’t complete until I finished, I started trying to tackle the problem.

IMPORTANT: The key observation to resolve this problem is that, at build time, there is zero coupling between building the Java code that defines native methods and building the native C code that supports them. This is bad because neither you nor the compiler can’t know if things will work until you run them. This, however, is good because it gives us the freedom to build the native code separately from the Java code, and allows us to glue them together at run time.

My first thought was: let’s remove all dependencies from the Java rules to the native libraries. Then, let’s add a single cc_library target that generates a single native library by pulling all individual bits as deps. And, finally, let’s add a trivial System.loadLibrary statement to Bazel’s and Blaze’s main function to load their own single artifact.

I prototyped this solution and was excited by the huge amount of garbage I could remove. The final solution was elegant, very easy to understand, and both Blaze and Bazel both worked… except… tests failed left and right. You see: tests don’t invoke Bazel’s main entry point so they didn’t get a chance to load the needed JNI library in this design. I could have littered the test build graph with explicit dependencies on the JNI library and explicit calls to System.loadLibrary, but that would have been awful. I was trying to simplify things, not make them worse!

So I had to scrap this idea and go back to the drawing board.

A better approach

The two constraints driving my search for a solution were simple:

Ensure we load a single shared library at run time to avoid name clashes.
I had to loosen this restriction a little bit because our build ends up pulling another shared library that’s out of our control: Netty’s kqueue and epoll native bindings. Upon careful examination, I saw that this library is trivial and doesn’t have any dependencies, so we can ignore it for our purposes. But this highlights that the solution I came up with is not generic enough to work for arbitrary projects.
Ensure that any Java target that depends on another target that needs JNI can do so without having to care, at all, that there is a JNI dependency anywhere.
In my opinion, this was an unnegotiable property of the solution. If you have a piece of code depending on something else, you should give zero effs on how that happens: the dependency should just work for you. If the dependency wants to pull in native code, so be it, but you ought not to know.

Simply put, the design I envisioned is depicted in the following diagram:

Interaction between Blaze's Java code and the single JNI library after my changes in a macOS system. Note that what we want is a single `dylib` that embeds all separate libraries at build time so that the compiler can weed out duplicate symbols. It is OK if this `dylib` still depends on other shared libraries, as the dynamic linker knows how to handle those.

Simple, huh? We want a single shared library that bundles all individual native libraries at build time to remove duplicate symbols. We want to load that single library at runtime. And we want the contents of this library to support Google-internal extensions when building Blaze inside of Google, and to remain Google-agnostic when building Bazel outside.

Now, the question was: how do we get here, given all the many corner cases we have to deal with?

Steps towards the solution

The general thought process was: let’s try to isolate all JNI loading logic into a single Java package. Once we have that, we can try to make all other Java code depend on it and not have to worry about native dependencies. And to make that possible, we have to rely on extracting the JNI code from a Java resource instead of relying on the caller correctly preparing the java.library.path property on entry.

More specifically, here are the steps I took:

Bring clarity into the situation by forcing all JNI load requests to happen via a common code path. The end goal for this was to isolate all JNI code in a single place, making it easier to see how to change it. See commit 50adefb.
Simplify the way the Windows code paths loaded JNI. This wasn’t a necessity as part of this project, but was raised as a suggestion during a code review… and I like simplicity, so I did it. See commit 9ce9088.
Now that we had all callers funneling their JNI requests via the JniLoader, there was no reason to keep the separate UnixJniLoader and WindowsJniLoader classes, so they could be merged. This highlighted some important differences between the two: namely, that Windows relies on the runfiles library clutch to load the JNI code, and that this was only done for testing purposes. Ugly. This had to go, but not yet. See commit 7128580.
With the WindowsJniLoader gone, there was no reason to keep the separate windows/jni subpackage in the source tree… so I merged it into its parent. Again, unnecessary simplification as part of this project, but it was easy enough, so why not? See commit d69744d.
The key piece of this puzzle was replacing the way we actually look for and find the JNI library. Instead of relying on external actors such as the Bazel client, the bootstrap script, or unit tests to put the JNI library in java.library.path, we should be able to put the library inside a JAR resource and extract it at runtime. And it was possible. With this, all madness was gone and, finally, once and for all, the JniLoader could work by trivially pulling it in as a dependency. See commit a2a2f9d.
Modify the Google-internal fork of src/main/native:unix_jni to link in the previous network and vfs targets so that the libunix.dylib becomes the only JNI library we have to load at run time. This change was trivial at this point: just a matter of changing BUILD dependencies and replacing ad-hoc System.loadLibrary() calls with calls to our improved JniLoader.loadJni(). I cannot link to this change because it has no correspondence in Bazel though, given that Bazel hasn’t yet suffered from this issue (but it could!).

I like how this sequence highlights the importance of decoupling refactoring and cleanup work from the actual fixes. All of the steps in this list except for the last one were essentially no-ops and only laid the ground work to make the last step simple. And simple it was. Doing the work this way made it much easier for my reviewer to see what was going on and to assess that each step worked correctly. (Mind you, I actually broke the build somewhere in between two steps… and it was “easy” enough to fix it because the delta between changes was small.)

And with that, the original bug is now fixed!

Parting words

Executing this cleanup has been very frustrating as I briefly mentioned in the opening. Part of the reason is because of the many corner cases that had to be dealt with and how these were not clear upfront. But the other and major part of this came from the fact that Blaze and Bazel have divergent BUILD files in some areas, and ensuring they are all consistent with each other is painful. Throw in long BazelCI turnaround times and the need to get things to work on three different platforms… and you are in for pain.

Anyhow. The bug is now fixed. But is what I outlined above the final solution? No, not really. What I’ve done is still an ad-hoc solution to Bazel’s own build process and use of JNI. Anyone writing and building Java code with Bazel also needs to interact with JNI at some point, and those developers are subject to the same issues I’ve described here. I don’t think we have a great answer for them yet. In other words: help welcome.

So long, and thanks for all the fish!