About two weeks ago, I found a very interesting bug in Bazel’s test output streaming functionality while writing tests for a new feature related to Ctrl+C interrupts. I fixed the bug, wrote a test for it, and… the test itself came back as flaky, which made me find another very subtle bug in the test that needed a one-line fix. This is the story of both.

Bazel has a feature known as test output streaming: by default, Bazel captures the outputs (stdout and stderr) of the tests it runs, saves those in local log files, and tells the user where they are when a test fails. This is not very ergonomic when you are iterating on a test, so you can make Bazel print the output of the test as it runs by passing --test_output=streamed to the invocation.

Test output streaming sounds fancy: when I first encountered this, I envisioned Bazel truly streaming the output from a remote server as it was generated. But things are not like that: enabling test output streaming has the side-effect of running the tests locally. And if you look at the internals, all the magic vanishes: there is a class called FileWatcher that uses a thread to monitor for log file changes every 100ms. Sad face.

Bazel’s client/server architecture

To further understand this story, we need to look at how Bazel prints messages to the console. “It’s just a command-line tool, duh, so it can print to stdout as it wishes!” I hear you saying. Yeeees… but no. Things are more complex than that.

Bazel has a client/server architecture where both processes always running on the same machine. The server is the long-lived Java process that keeps state in memory and processes the commands, which means it’s the component that executes the tests. The ephemeral C++ client connects to this server to invoke a command and waits for the server to terminate execution of that command before returning control to the user.

See where am I going? The server is a daemon process: it has no access to the console. The client is the process that is attached to the console. While the client is actively running a command on the server, the client fetches stdout/stderr data from the server and prints it to the console. This communication all happens over gRPC (via a true stream). And once the client is gone, the server cannot print anything else.

The Ctrl+C bug during output streaming

So here is the bug: if you happened to be streaming the output of a test and you hit Ctrl+C, the following Bazel command you ran would detect that the server wasn’t there and start a new one.

This is unexpected. The server is intended to be long-lived and reused across different commands. Given that I had not changed any startup flags between the two invocations, the only possible explanation for this behavior was that the server had died while handling Ctrl+C. But this is very subtle behavior: there were no visible error messages. The only thing you could notice was the server starting a second time and thus the second command taking longer than it should have. But if you didn’t pay attention, it was super-easy to miss that something was wrong.

But I did notice because I was writing a test precisely for a new feature involved in Ctrl+C interrupt handling. And in this test, I saw things not behaving as expected, so I rolled up my sleeves and dug down a little.

Given that iterating on Bazel integration tests is painful due to how slow they can be, I tried to reproduce the problem by hand, which was easy. But I had to look into the server’s java.log to actually an indication of misbehavior:

java.lang.IllegalStateException: Stream is already completed, no further calls are allowed
        at com.google.common.base.Preconditions.checkState(Preconditions.java:508)
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:340)
        at com.google.devtools.build.lib.server.GrpcServerImpl$BlockingStreamObserver.onNext(GrpcServerImpl.java:192)
        at com.google.devtools.build.lib.server.GrpcServerImpl$RpcOutputStream.write(GrpcServerImpl.java:258)
        at com.google.devtools.build.lib.util.io.DelegatingOutErr$DelegatingOutputStream.write(DelegatingOutErr.java:117)
        at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
        at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
        at com.google.devtools.build.lib.runtime.UiEventHandler$FullyBufferedOutputStreamMaybeWithCounting.flush(UiEventHandler.java:218)
        at com.google.devtools.build.lib.runtime.UiEventHandler.handleIfCapacityPermits(UiEventHandler.java:402)
        at com.google.devtools.build.lib.runtime.UiEventHandler.handleInternal(UiEventHandler.java:502)
        at com.google.devtools.build.lib.runtime.UiEventHandler.handle(UiEventHandler.java:521)
        at com.google.devtools.build.lib.events.Reporter.handle(Reporter.java:142)
        at com.google.devtools.build.lib.events.ReporterStream.write(ReporterStream.java:55)
        at com.google.devtools.build.lib.exec.TestLogHelper$FilterTestHeaderOutputStream.write(TestLogHelper.java:123)
        at com.google.devtools.build.lib.util.io.FileWatcher.run(FileWatcher.java:96)

Yikes. The UI again? Please no. Spoiler alert: it wasn’t.

I won’t get into a lot of detail here, but if we trace through the classes involved, what we find is that Bazel is writing into the gRPC stream used to feed stdout from the server to the client, but it does so after the client has terminated. In other words: we have a resource leak (the gRPC stream) that outlives the client and, thanks to some paranoid coding, we detect an invalid state later on.

This didn’t surprise me. Handling interrupts correctly is very difficult no matter the programming model: signal handlers, Go channels, Java interrupts… all of them are hard. Furthermore, interrupts are rare, so their handlers often see little testing. What surprised me, though, was: if this is such an obvious bug, how come nobody had ever noticed it before? Bazel has a long history and tons of invocations per day, so surely somebody must have hit Ctrl+C and triggered this?

Well, apparently I might have been the first one to notice and pause to think that something was off. The reason is that the misbehavior we observe here depends on the strategy used to run the tests. As we saw in the past, the strategy is the component in charge of running subprocesses and dealing with their outputs, and thus the strategy is the component that is interrupted when you hit Ctrl+C. It was possible that the strategy was leaking resources in some way and making them outlive the client.

And indeed that was the case. Months earlier, I had spent some time ensuring the local strategy handled process interrupts correctly… but I had not touched the sandboxed strategy at all. And, guess what, I was seeing problems because the sandboxed strategy was in use. Which explains why this hasn’t seen a ton of exposure before: within Google, for example, we default to remote test execution so cancellations happen in very different ways.

Having diagnosed the problem, fixing it was relatively straightforward but required a code audit. I encountered the obvious issue: the sandboxed strategy was killing the subprocess but failed to wait for its completion (sending a signal is an asynchronous action), but I also found other possible leaks. Writing the integration test for the fix wasn’t that simple though.

The new integration test

Here is how the test for this bug fix works:

Make Bazel run a test program that prints a “cookie” message and then simply sleeps for a long time, waiting for it to be interrupted.
Run the test with --test_output=streamed in a background invocation of Bazel, redirecting all stdout/stderr to a log file.
Periodically grep the log file until the “cookie” message shows up, which indicates that Bazel has reached the point in which we want to send the interrupt.
Send an interrupt signal to Bazel (the client).
Wait until the background process (the client) finishes to ensure the command has completed.
Check the Bazel server logs to verify that there isn’t a crash (whitebox testing).
Repeat all steps from above to run the test a second time, without explicitly shutting down the Bazel server.
Verify that the output of the second run doesn’t say that the Bazel server had to start again (blackbox testing).

It took me a while to get all the details right, possibly because of the very subtle bug in the test I will discuss below. But, eventually I got the test to pass on all the platforms BazelCI supports, which is always… involved, so I submitted it and finished the work in the original Ctrl+C feature that brought me here.

Flakiness in the new test

A few days later, the dreaded report came: “your new test is flaky”. Yikes, I hate bugs like these. Debugging Bazel integration tests and looking for flakiness is a task that can easily suck a full day, if not more. Luckily, Bazel’s --runs_per_test=100 combined with remote execution is super-handy here, as you can expose any kind of flakiness in minutes.

I’m not exactly sure how I went about troubleshooting the problem, but it wasn’t trivial. I researched a bunch of loose ends and at some point I added timestamps to the various steps the test was taking. This was the key to uncover the bug: somehow, the second iteration of the test was running almost immediately after the first one. Which wasn’t normal because the test does some pretty slow polling (1-second steps) to wait for Bazel to reach the interrupt point. I was expecting to see, at least, a 1-second delay between the first and the second test. But it wasn’t there.

And then it hit me. Here is the simplified version of how I ran Bazel in the background and waited for it to show the “cookie” message:

bazel "${@}" >"${TEST_log}" 2>&1 &
while ! grep -q "Ready for interrupt" "${TEST_log}"; do
  sleep 1
done

I had implemented this same logic a bunch of times in different tests and never had problems, so I didn’t give it a second thought. But this test is slightly different because it runs the above code snippet twice in a loop, without the test framework’s cleanup intervening.

The most plausible explanation for the lack of any delay in the second run meant that maybe we were picking up the log of the previous invocation by mistake during the second test run. And, that was precisely the problem. But why? How?

The reason is simple: the code above works reliably when TEST_log doesn’t exist, but it is racy when it does. You see: the first line in the snippet above, the one that runs bazel in the background, is problematic. That line makes the shell redirect the output to TEST_log… but the key observation is that the redirection is configured in the subprocess: when that statement completes, there is no guarantee that TEST_log has been truncated and set up as the output of the bazel command yet. “Obviously” you might say, and it is obvious if you think about it… but you need to know quite a bit about how the shell works to reason about this—and it’s hard to notice after you’ve written this same pattern successfully countless times.

With this in mind, the fix for the test’s flakiness was trivial: ensure we start without a TEST_log (or with an empty log) to remove the possibility of the race condition.