Running a healthy production service

In a previous thread, I covered some techniques to approach on-call shifts and maintain your own well-being. In this thread, I will touch upon the things you can do, as a team, to make your service more sustainable. 🧵 👇

The core theme is: you must care about and fix the little problems to surface the signal from the noise. In doing so, you ensure the team’s collective time can be spent on the problems that matter. As you will see, there aren’t any grandiose ideas here.

If you do want to learn about the grandiose ideas, however, there is plenty of material out there—starting with Google’s popular SRE book but followed by lots more. You may need some of that material to apply the tips here though.

Lastly, don’t let the simplicity of these tips misguide you. Yes, they are simple. But being on a team that follows them vs. being in a team whose on-call practices have grown “out of control” is like the difference between night and day. So… let’s get started!

📏 Define goals (SLOs) and monitor them (via SLIs). If you don’t know what your service’s offering is and cannot measure it precisely, you can’t tell what’s important and what isn’t. Incidents will be misfiled and you won’t be able to triage them efficiently.
🔧 Make sure mitigation steps exist for common incidents. If you don’t know how to mitigate possible issues, there is no way the service can offer a bound on how quickly outages will be resolved. (That’s right: most SLOs out there are… aspirational.)
📕 All automatic alerts must have troubleshooting guides. The relationship must be 1:1 and the guides must describe the alert’s meaning and steps to assess impact, mitigate, and troubleshoot. When you get a flood of pages, you must be able to weed out the noise.
🤝 Review incidents (pages and tickets) weekly. Run a meeting where you go over all incidents from last week, ranging from major outages to a tally of all individual pages. Requires quite a bit of preparation, so someone (a lead, a rotation) must own this meeting.
✒ Have written documentation. People seem to be increasingly consuming learning material via videos. No matter their quality, videos are horrible for on-call response because you cannot quickly extract random information from them.
⌛ Have a “30 seconds” guide. This is a document that lists questions (not answers) that every on-call person should be able to answer, without reference material, in 30 seconds or less. Think of it as a checklist of “the very basics” which on-callers can use to “warm up”.
🦜 Run “walk the plank” exercises. These are periodic meetings where a “volunteer” plays the role of on-call and another person plays the role of the monitoring system or an impatient customer. The team watches to learn new techniques and spot gaps in training material.
💻 Always question “bad machine” as a root cause. If your service is properly designed, it should never fail because a single machine misbehaved (due to a bug or a hardware failure). If the incident is valid, though, understand where the design is faulty and follow up on that.
🤷‍♀️ Always question “transient” root causes. If you get a page and it silently auto-resolves while you were looking, the monitoring system is likely measuring the wrong thing (a metric with no visible user impact) or it is too sensitive.
📜 Persist commands in scripts. Whenever you need to share a handy command to mitigate a problem, put that in a script, get it reviewed and check it in. Copy/pasting those commands from emails or shared docs can be dangerous (hint: stale or unsafe operations, smart quotes).
2️⃣ Have a dual set of SLOs. The published ones, which might be backed by an SLA; and some internal ones, which are tighter than the public ones. Then alert on the latter. The idea is to offer a service that’s slightly better than what you promise, and spot problems early on.
📣 But be careful with “offering a better service than promised”. Your customers will rely on that behavior and then request you to maintain it. If you can, build in throttles to offer exactly the service you promise, even if that means slowing things down.
📧 Keep email clean. You need two classes of mailing lists: one exclusively for human-initiated email and one for all automated activity (incident notifications, commit logs, bug reports). Require team members to subscribe and read the former but make the latter optional.
➗ Clearly separate on-call time vs. project time. Only the people who are on-call should be looking at incoming pages, incidents, and questions. Everyone else has to focus on their assigned projects. This keeps the team focused, and helps uncover if on-call is underfunded.

And that’s about it for today. If you choose to adopt any of these tips for your team, be aware that it’ll take some dedicated and non-trivial effort to complete them. So… good luck! 😊

Featured software

Featured posts