How "new type" helps avoid production outages

My January links recap included the “Phantom Types” article by David Soria Parra. In it, the author briefly touches upon the “new type” idiom, its typical implementation in Rust, and then proceeds to propose a better alternative. But the question arises: why should you care? To answer why this idiom is useful, I want to present you with a real production problem we faced in the Storage Infrastructure team at Google circa 2010.

March 9, 2024 · Tags: blogsystem5, rust, sre, twitter-thread
Continue reading (about 4 minutes)

Good performance is not just big O

Having a fast and responsive app is orthogonal to “knowing your big Os”. Unfortunately, most tech companies over-emphasize algorithms in interviews and downplay systems knowledge, and I believe that’s one reason behind sluggish apps and bloated systems. I’ve seen this play out repeatedly. Interviewers ask a LeetCode-style coding question, which is then followed by the ritual of discussing time and memory complexity. Candidates ace the answers. But then… their “real” code suffers from subtle yet impactful performance problems. Focusing on big O complexity rarely matters in most apps. Sure, it’s important to think about your algorithmic choices, but there are so many more details to worry about that have a direct impact on app performance and responsiveness. Let’s look at a bunch of them!

September 8, 2023 · Tags: programming, sre, twitter-thread
Continue reading (about 6 minutes)

Costs exposed: Frameworks

“Fast machines, slow machines”… ah, the post that spawned these series. As I frantically typed that article while replying to angry tweets, the thought came to mind: software engineering as a whole is hyper-focused on lowering the costs to write new code, yet there is a disregard for the costs that these improvements bring to other disciplines in a company on even to end users. So, in this series finale, I want to compare how some choices that apparently lower development costs actually increase costs elsewhere. I also want to highlight how, if we made different decisions during development, we could possibly expose those extra costs early on. This is beneficial because exposing costs upfront allows us to make tough choices when there is still a chance of changing course. To make things specific, I will look at how the use of modern frameworks that facilitate development can end up hurting performance, reliability, and usability. So let’s start with a three-part rant first (sorry) and then let’s look at what we might do.

August 31, 2023 · Tags: opinion, programming, sre
Continue reading (about 7 minutes)

Costs exposed: On-call ticket handling

In the previous post, I proposed that certain engineering practices expose systemic costs and help with planning while other practices hide those same costs and disturb ongoing plans. The idea I’m trying to convey is hard to communicate in the abstract so, in that post, I used the differences between a monorepo and a multirepo setup as an example. Today, I’ll expore a different scenario to support the same idea. I’m going to talk about how certain ticket assignment practices during on-call operations can expose service support costs vs. how other practices hide them. Keep in mind that, just like in the previous post, I do not want to compare the general merits of one approach vs. the other. The only thing I want to compare is whether one approach centralizes toil and allows management to quantify its cost vs. how another approach hides toil by smearing it over the whole team in hard-to-quantify ways. Whether management actually does something to correct the situation once the costs are exposed is a different story.

August 26, 2023 · Tags: opinion, programming, sre
Continue reading (about 7 minutes)

Fair on-call scheduling

Being part of an on-call rotation is a requirement for many software job positions and fulfilling this requirement should not come with stress. One process that can cause friction is how a team schedules its on-call shifts. If the on-call scheduling process is haphazard, team members will end up with shifts over their personal plans, and they’ll be unhappy. But there are ways to prevent this, which is what this post is about.

January 10, 2022 · Tags: sre, wellbeing
Continue reading (about 10 minutes)

Principal engineers should be on-call

A recent tweet that caught my attention read: “principal engineers should be on-call”. Of course they should be! I’m “surprised” they aren’t everywhere, but I can imagine some reasons to justify their situation. Let’s change that in this thread. 🧵 👇

July 14, 2021 · Tags: opinion, sre, twitter-thread
Continue reading (about 4 minutes)

Running a healthy production service

In a previous thread, I covered some techniques to approach on-call shifts and maintain your own well-being. In this thread, I will touch upon the things you can do, as a team, to make your service more sustainable. 🧵 👇

June 18, 2021 · Tags: sre, twitter-thread
Continue reading (about 4 minutes)

Tips on well-being while on-call

Last week, I was first-time on-call for a part of Azure Storage. My previous background as an SRE at Google helped me remain calm despite my inexperience. And as we have more first-time on-callers joining soon, I couldn’t resist writing some advice for them. Let’s start! 🧵

March 10, 2021 · Tags: sre, twitter-thread
Continue reading (about 4 minutes)

Seeding a file server quickly

Say you want to copy a large collection of files to a file server on your same network. What’s the fastest way to do this initial copy? Physically attaching the drive to the server? Maybe, but will the file systems be compatible? What about using the network? If so, which protocol? Read on for more details and how tar plus Netcat delivered the best results.

February 5, 2021 · Tags: sre
Continue reading (about 6 minutes)

Ensuring system rewrites are truly necessary

You probably know that software rewrites, while very tempting, are expensive and can be the mistake that kills a project or a company. Yet they are routinely proposed as the solution to all problems. Is there anything you can do to minimize the risk? In this post, I propose that you actively improve the old system to ensure the new system cannot make progress in a haphazard way. This forces the new system to be designed in such a way that delivers breakthrough improvements and not just incremental improvements.

January 24, 2020 · Tags: bazel, essay, featured, sre
Continue reading (about 7 minutes)

The fallacy of forbidding assertions

There are two ways to handle abnormal conditions in a program: errors and assertions. Errors are a controlled mechanism by which the program propagates details about a faulty condition up the call chain—be it with explicit error return statements or with exceptions. Errors must be used to validate all conditions that might be possible but aren’t valid given the context. Examples include: sanitizing any kind of input (as provided by the user or incoming from the network), and handling error codes from system calls or libraries.

July 24, 2018 · Tags: programming, reliability, sre
Continue reading (about 5 minutes)