Essays on software development with a focus on quality and production engineering. Mostly.
After a little over 11 years, it's time for a much longed change: I'm leaving Google and I'm joining Microsoft as a Principal Software Engineer for Azure. These job changes are effective as of this week, but my family and I already moved from New York City to Redmond, WA about three weeks ago. Read on for a recap on my tenure at Google, the whys behind my departure, and how I ended up choosing the position in Microsoft Azure after mulling over offers from Facebook, Twitter, and Microsoft.
As you might have read elsewhere, I’m leaving the Bazel team and Google in about a week. My plan for these last few weeks was to hand things off as cleanly as possible… but I was also nerd-sniped by a bug that came my way a fortnight ago. Fixing it has been my self-inflicted punishment for leaving, and oh my, it has been painful. Very painful. Let me tell you the story of this final boss.
About two weeks ago, I found a very interesting bug in Bazel’s test output streaming functionality while writing tests for a new feature related to Ctrl+C interrupts. I fixed the bug, wrote a test for it, and… the test itself came back as flaky, which made me find another very subtle bug in the test that needed a one-line fix. This is the story of both. Bazel has a feature known as test output streaming: by default, Bazel captures the outputs (stdout and stderr) of the tests it runs, saves those in local log files, and tells the user where they are when a test fails.
About a month ago, I was benchmarking the impact of a new Bazel feature and I noticed that a test build that should have taken only a few seconds took almost 10 minutes. My Internet connection was flaking out indeed, but something else didn’t seem right. So I looked and found that Bazel was doing network calls within a critical section, and these were the root cause behind the massive slowdown. But how did we get such an obvious no-no into the codebase? Read on to see how this happened and how gnarly it was to fix!
The pkgsrc package database, which by default lives in /var/db/pkg/, should not be there. Instead, it should be under /usr/pkg/libdata/pkgdb/. The same applies to FreeBSD's and OpenBSD's ports and also Debian's dpkg, but I'll focus on pkgsrc because it's the system I know best. Let's see why the current default is suboptimal and why libdata is a good alternative.
The scripts that live under /etc/rc.d/ in FreeBSD, NetBSD, and OpenBSD are in the wrong place. They all should live in /libexec/rc.d/ because they are code, not configuration. Let's look at the history of these systems to see how we got here, why this is problematic, and how things would look like in a better world.
In the previous post, we saw how .d directories permit programmatic edits to system-wide configuration with ease. But this same concept can be applied to other kinds of tracking. Let's dive into a few examples ranging from desktop menu entries to the package manager's database itself.