Showing 3 posts
As you might have read elsewhere, I’m leaving the Bazel team and Google in about a week. My plan for these last few weeks was to hand things off as cleanly as possible… but I was also nerd-sniped by a bug that came my way a fortnight ago. Fixing it has been my self-inflicted punishment for leaving, and oh my, it has been painful. Very painful. Let me tell you the story of this final boss.
About two weeks ago, I found a very interesting bug in Bazel’s test output streaming functionality while writing tests for a new feature related to Ctrl+C interrupts. I fixed the bug, wrote a test for it, and… the test itself came back as flaky, which made me find another very subtle bug in the test that needed a one-line fix. This is the story of both. Bazel has a feature known as test output streaming: by default, Bazel captures the outputs (stdout and stderr) of the tests it runs, saves those in local log files, and tells the user where they are when a test fails.
About a month ago, I was benchmarking the impact of a new Bazel feature and I noticed that a test build that should have taken only a few seconds took almost 10 minutes. My Internet connection was flaking out indeed, but something else didn’t seem right. So I looked and found that Bazel was doing network calls within a critical section, and these were the root cause behind the massive slowdown. But how did we get such an obvious no-no into the codebase? Read on to see how this happened and how gnarly it was to fix!