In the previous post, I proposed that certain engineering practices expose systemic costs and help with planning while other practices hide those same costs and disturb ongoing plans.
The idea I’m trying to convey is hard to communicate in the abstract so, in that post, I used the differences between a monorepo and a multirepo setup as an example. Today, I’ll expore a different scenario to support the same idea. I’m going to talk about how certain ticket assignment practices during on-call operations can expose service support costs vs. how other practices hide them.
Keep in mind that, just like in the previous post, I do not want to compare the general merits of one approach vs. the other. The only thing I want to compare is whether one approach centralizes toil and allows management to quantify its cost vs. how another approach hides toil by smearing it over the whole team in hard-to-quantify ways. Whether management actually does something to correct the situation once the costs are exposed is a different story.
A blog on operating systems, programming languages, testing, build systems, my own software projects and even personal productivity. Specifics include FreeBSD, Linux, Rust, Bazel and EndBASIC.
Let’s begin.
On-call shifts are common in software development teams, yet I think it’s fair to say that most developers hate and fear them. Which is a natural reaction because, in general, on-call rotations are haphazardly run.
In a well-functional on-call rotation, individuals are either “on” or “off”. When they are “on”, their only responsibility is to attend to incoming tickets: they are not expected to work on their deliverables. When they are “off”, their only responsibility is to attend to their deliverables: they can rest assured that the current on-call will shield them from production fires.
Now, it is possible that the team’s deliverables will have to change because of production incidents, but those changes in the plan will be factored into the schedule, say, in the following sprint or quarterly planning meeting. Until that happens, the “off” people don’t have to be distracted by ongoing fires. I know, I know, this is an extremely hard-to-achieve ideal, but I’ve seen it work in mature SRE teams and being on-call for those was a reasonable experience. I highly recommend watching the “Bad Machinery - Managing Interrupts Under Load” presentation by Dave O’Connor on this topic.
Unfortunately, while most on-call rotations are well-intentioned, they are not well-functioning. This is particularly true of large teams (30+ people) divided in subteams (4-10 people each). In such a large organization, no single individual can realistically be on-call for the whole service due to its scope—yet teams insist on putting a single person on-call. But is that person the only one that’s “on”? Nope. In reality, many more team members are busy with past incidents.
Here is how that happens: the policy for incoming tickets is to assign them to the current on-call individual in a sticky manner: said individual “owns” those tickets until they are either resolved or are transferred to a Subject Matter Expert (SME)—no matter if they are on-call or not. This fire-hose approach seems to work… but it has a draining cost on the individual team members and impacts overall team productivity, leading to delayed deliverables and production issues that are never fixed.
These are the reasons that are often cited to justify this fire-hose approach… along with why they aren’t a great idea:
“The process is fair from a time-sharing perspective!”
Well, no, it just looks fair on paper. What this practice does is push operations under the rug. Incoming issues are randomly spread across team members. It is very likely that those issues will not be resolved during the on-call’s scheduled shift. As a result, every individual continues to own incidents well-past their “on” time, and because those are production incidents, the assignees are obviously expected to work and resolve “their” incidents “with priority”, preempting their project work.
Now, the counter-argument to this is to say that these individuals should transfer incidents to the SME as soon as their “on” shift ends. But there are two problems with this. The first is that an SME may not even exist in the team anymore. The second is that this assignment process is adversarial: junior folks will almost-never transfer an incident to a senior person because… of the next point.
“Everyone should be able to be on-call for the whole service!”
Yeah, that’s a great ideal: it’d be awesome if everyone in the team were able to handle any incoming incident, but that is not what happens. Large teams tend to grow to cover too many responsibilities too fast, so it ends up being impossible for any one individual to know everything about the system.
There are select exceptions though: the leads that created the product. These folks—if they are still around—have grown with the product and have sufficient knowledge to troubleshoot most problems under the pressure of a production outage. Unfortunately, these outlier folks tend to be the ones with the power to change the status quo. In turn, this makes change harder because they don’t realize that there is a problem.
As a result of these two attitudes, on-call operations seem cheap from the management’s perspective: there is only one person on-call at any given time for the whole team who, somehow, manages to resolve all issues assigned to them. There are often delays in resolving incidents and there are repeats of those outages, sure, but they aren’t large enough or frequent enough to be concerning. (Narrator voice: until they are.)
What’s not so clear is that every other team member that is not on-call is also wasting time to resolve incidents because they are keeping an eye on past and ongoing issues. First, previous on-call individuals carry past issues well into their “off” periods, disturbing their deliverables. Remember that context switches are harmful to productivity, and those switches are even worse if they require talking to customers. And, second, past and new incidents pull random engineers from the team to help, preempting their assigned work.
Under this model, it is really difficult to quantify how much time is spent caring for production, so it is impossible to properly budget such time against feature work. “Just put one person on-call” should not be a convincing argument without data. It is also hard to identify recurring issues and follow up on them with long-term fixes, so the service’s quality degrades over time.
What’s the alternative, assuming there is one? It turns out that there is, and it is what I mentioned at the beginning of the post:
- a clear “on” / “off” split for operational support;
- a clear process to transfer ongoing incidents to the following on-call person, along with an expectation that this is the normal thing to do; and
- a clear distinction between mitigation practices and resolution efforts.
Simply put: whenever a person is on-call, all incidents come to this person and this person owns them until they are mitigated or until their shift ends, whatever happens first. Once mitigated, the on-call files follow-up repair tasks, and these tasks are assessed against other priorities in the following planning meeting. The main difference from the fire-hose approach is that, once a person’s “on” shift ends, they can get back to their past-assigned and pre-planned work.
If these practices are strictly followed, it quickly becomes clear whether a single on-call person can sustain the health of the service or not. If a single person cannot, incidents will backlog or they will not be mitigated correctly. At that point, management will have to allocate additional people to handle incidents. And if they assign extra people to on-call operations, they can also choose to invest in root cause resolution to reduce overall service load, reducing the total number of people assigned to operations at any given time. (But yes, whether this extra investment happens or not is orthogonal to exposing costs and I don’t want to get into that. Sometimes management will simply not want to invest in extra reliability and on-call health will be terrible. Run if you can?)
To summarize: this is yet another case where the practice of doing “the right thing” (in SRE terms) by forcing a strict “on” / “off” division exposes the reality that a system may be more expensive to operate than previously believed, whereas the common practice of assigning tickets in a sticky manner is equally expensive, if not more so, while also causing dissatisfaction.
In the next post, I’ll conclude this series by looking at how fancy software frameworks hide non-programming costs and how those costs are often eaten by people that may not have a choice.