Fair on-call scheduling

Being part of an on-call rotation is a requirement for many software job positions and fulfilling this requirement should not come with stress. One process that can cause friction is how a team schedules its on-call shifts. If the on-call scheduling process is haphazard, team members will end up with shifts over their personal plans, and they’ll be unhappy. But there are ways to prevent this, which is what this post is about.

January 10, 2022 · Tags: sre, wellbeing
Continue reading (about 10 minutes)

Principal engineers should be on-call

A recent tweet that caught my attention read: “principal engineers should be on-call”. Of course they should be! I’m “surprised” they aren’t everywhere, but I can imagine some reasons to justify their situation. Let’s change that in this thread. 🧵 👇

July 14, 2021 · Tags: opinion, sre, twitter-thread
Continue reading (about 4 minutes)

Running a healthy production service

In a previous thread, I covered some techniques to approach on-call shifts and maintain your own well-being. In this thread, I will touch upon the things you can do, as a team, to make your service more sustainable. 🧵 👇

June 18, 2021 · Tags: sre, twitter-thread
Continue reading (about 4 minutes)

Tips on well-being while on-call

Last week, I was first-time on-call for a part of Azure Storage. My previous background as an SRE at Google helped me remain calm despite my inexperience. And as we have more first-time on-callers joining soon, I couldn’t resist writing some advice for them. Let’s start! 🧵

March 10, 2021 · Tags: sre, twitter-thread
Continue reading (about 4 minutes)

Seeding a file server quickly

Say you want to copy a large collection of files to a file server on your same network. What’s the fastest way to do this initial copy? Physically attaching the drive to the server? Maybe, but will the file systems be compatible? What about using the network? If so, which protocol? Read on for more details and how tar plus Netcat delivered the best results.

February 5, 2021 · Tags: sre
Continue reading (about 6 minutes)

Ensuring system rewrites are truly necessary

You probably know that software rewrites, while very tempting, are expensive and can be the mistake that kills a project or a company. Yet they are routinely proposed as the solution to all problems. Is there anything you can do to minimize the risk? In this post, I propose that you actively improve the old system to ensure the new system cannot make progress in a haphazard way. This forces the new system to be designed in such a way that delivers breakthrough improvements and not just incremental improvements.

January 24, 2020 · Tags: bazel, essay, featured, sre
Continue reading (about 7 minutes)

The fallacy of forbidding assertions

There are two ways to handle abnormal conditions in a program: errors and assertions. Errors are a controlled mechanism by which the program propagates details about a faulty condition up the call chain—be it with explicit error return statements or with exceptions. Errors must be used to validate all conditions that might be possible but aren’t valid given the context. Examples include: sanitizing any kind of input (as provided by the user or incoming from the network), and handling error codes from system calls or libraries.

July 24, 2018 · Tags: programming, reliability, sre
Continue reading (about 5 minutes)