Tips on well-being while on-call

Last week, I was first-time on-call for a part of Azure Storage. My previous background as an SRE at Google helped me remain calm despite my inexperience. And as we have more first-time on-callers joining soon, I couldn’t resist writing some advice for them. Let’s start! 🧵

March 10, 2021 · Tags: sre, twitter-thread
Continue reading (about 4 minutes)

Seeding a file server quickly

Say you want to copy a large collection of files to a file server on your same network. What’s the fastest way to do this initial copy? Physically attaching the drive to the server? Maybe, but will the file systems be compatible? What about using the network? If so, which protocol? Read on for more details and how tar plus Netcat delivered the best results.

February 5, 2021 · Tags: sre
Continue reading (about 6 minutes)

Ensuring system rewrites are truly necessary

You probably know that software rewrites, while very tempting, are expensive and can be the mistake that kills a project or a company. Yet they are routinely proposed as the solution to all problems. Is there anything you can do to minimize the risk? In this post, I propose that you actively improve the old system to ensure the new system cannot make progress in a haphazard way. This forces the new system to be designed in such a way that delivers breakthrough improvements and not just incremental improvements.

January 24, 2020 · Tags: bazel, essay, featured, sre
Continue reading (about 7 minutes)

The fallacy of forbidding assertions

There are two ways to handle abnormal conditions in a program: errors and assertions. Errors are a controlled mechanism by which the program propagates details about a faulty condition up the call chain—be it with explicit error return statements or with exceptions. Errors must be used to validate all conditions that might be possible but aren’t valid given the context. Examples include: sanitizing any kind of input (as provided by the user or incoming from the network), and handling error codes from system calls or libraries.

July 24, 2018 · Tags: programming, reliability, sre
Continue reading (about 5 minutes)