Last week, I was first-time on-call for a part of Azure Storage. My previous background as an SRE at Google helped me remain calm despite my inexperience. And as we have more first-time on-callers joining soon, I couldn’t resist writing some advice for them. Let’s start! π§΅
π¨ First, a warning: the only thing you must know how and when to do, and the only thing you can be faulted for not doing, is asking for help. No matter how good you are, you will eventually need help (e.g. concurrent outages) and you must be comfortable requesting it.
At the end of the day, it doesn’t matter who fixed the service. The only thing that matters is fixing the problem for your customers as quickly as possible. (Know about “no heroics?” Good. I’ll get to that soon π)
With that out of the way, let’s look at the tips to maintain your well-being. I’ll assume a split rotation here where you are on-call only for part of the day. These tips are super-important because you do not want to π₯ burn out or π± fear being on-call the next time!
π§ββοΈ Keep calm. I know this is easy to say, and I realize that no matter how I say it, you might not be able to stay calm. It takes time to build an βattitudeβ towards on-call. The rest of the tips are intended to help you achieve this.
π Avoid weekends at first. Your first shift will be stressful. You will not feel comfortable calling your backup on their βfreeβ time. Change your shift. This is a good thing for you but, more importantly, it is also the responsible thing to do for the service.
π¦ΈββοΈ No heroics. Do not assume that you must fix everything no-matter-what and end up working 18-hour days. The service is what it is, and you have to take care of it in its current form. You are not going to fix structural problems during your shift—and that’s OK.
π€ Rest. By the end of every on-call day, you will likely have unresolved loose ends. You may want to reply to one more email. Avoid the urge. Hand off ongoing incidents to the next on-call person, who is well-rested and has a fresh mind. Then recharge until the next day.
β No project work. Assume that on-call will take your full attention during your shift and the next few of days. Do not count on having time to work on anything else. If you must deliver work items during your shift, find someone else to do those or… exchange your shift.
π You are on-call, not on-duty. In other words: no call, no response. During business hours you may have to deal with low-priority incidents. But outside of business hours, do not perform production duties unless you get paged. Use that extra time to rest.
I’m sure I’m forgetting something about well-being, but I’ll leave you to the expansive literature on the topic. I was going to cherry-pick parts of Conference Report: SRECon Americas 2019 by @whereistanya, but heck, the whole report is amazing. Go read it.
Anyhow, now that you know that you must be well-rested and comfortable in order to perform well at on-call, let’s dive into a few tips to handle your first on-call shift:
π’ Prioritize. Not all pages are made equal. When you get 5 pages within 30 minutes (it will happen), you need to know what can and cannot be postponed. E.g. if an issue has been repeatedly paging for days, it likely can wait a bit longer.
π Mitigate; do not fix. Solve the user-visible problem first; make the problem go away. If you find yourself root-causing or reaching for a debugger, you are probably not mitigating. Mitigation will buy you time to address other incoming incidents. Fix the root causes later.
π©βπ Prepare. Even though formal training may be limited, you can prepare to be a better first-time on-caller. Familiarize yourself with existing playbooks. Ask to handle a few incidents the days before your shift. Shadow experienced on-callers.
π± Know how to escalate. And, to conclude, let’s rewind: learn how to summon your backups, other people from your team, other teams, and the incident management team (if you have one). You have to be comfortable doing so and know how to do it. That’s how you get help.
Of course, all of the above items require that the culture around you is supportive of a healthy on-call rotation and of improving operational practices. If that’s not the case… you’ll have to fix that first—or πββοΈ run π .