Like it or not, being part of an on-call rotation is a requirement for many software job positions. Fulfilling this requirement should not come with stress but, unfortunately, it often does because most dev teams don’t assign dedicated individuals to mend and improve the on-call logistics. Consequently, many of the on-call policies and processes appear organically as a result of well-intentioned… patchwork, so they aren’t as good as they could be.
One process that can cause friction is how a team schedules its on-call shifts. If the on-call scheduling process is haphazard, team members will end up with shifts over their pre-planned vacations, medical appointments, school events, and what have you. Those individuals will then be in an anxiety-inducing position: they’ll have to ask for swaps with others—unknowing if they’ll find a replacement, as those swaps may involve some transitive trades—or escalate to management for a solution.
The thing is that this problem is completely avoidable. There are ways to create an on-call schedule that is fair to all team members (including those “high-level” engineers) and that minimizes disruptions to personal plans. As I see it, a team that cares about its people should be open to adopting these improvements. And this is why I’m bringing the ideas below to my current team, because I’m trying to make us all more comfortable when performing our on-call duties 🙂.
A blog on operating systems, programming languages, testing, build systems, my own software projects and even personal productivity. Specifics include FreeBSD, Linux, Rust, Bazel and EndBASIC.
Let’s get started. Our task is simple: given the following schedule of already-scheduled shifts from weeks 1 through 7 (where
G are the names of our team members), how do we staff weeks 8 through 14 with a primary on-call?
Naïve approach: literal rotation
The simplest approach to scheduling on-call shifts is to use a literal rotation: take the list of candidates as they were scheduled in the past and re-schedule them in roughly the same order (maybe you need to remove or add people from the rotation). The schedule could look like this:
I’ve seen this approach in some dev teams at Google, and it is what my current dev team at Microsoft does. It’s a common approach because it’s easy to implement either manually or with a tool, and, for the most part, it seems to work. And I say seems because this approach brings up all of the anxiety-inducing problems described earlier: the schedule will step over people’s non-work plans and will put every individual in the bad position of having to figure out shift exchanges.
To make this approach less problematic, it is possible to tweak the scheduling process. An option is to generate a rotation that spans several months and announce it well before it goes live. Something like this:
The rationale goes: if we announce the schedule long before it takes effect, people will have sufficient time to plan their time off around it and will have sufficient time to ask for trades. In the extreme, the lead times can be like 6 months, which is what we currently do in my team to make this process more palatable.
Unfortunately, this tweaked process still doesn’t work well. I don’t know about you, but I cannot plan time off with such a long horizon. Sure, I know some things will happen at specific times, like the Christmas holidays… but not everything. Remember that on-call conflicts are not only vacation-related: there are also medical appointments, school events with the kids, etc. and those are rarely planned months before. Oh, and what if you end up being scheduled over a common holiday? How do you ask for a swap? Almost everyone else will want to be off as well, so it’s hard to ask for a swap with a straight face.
Thinking about scheduling in this way is backwards because it is putting the on-call duties before people’s lives—and that’s a recipe for burnout and attrition. So, what do we do?
The first step in creating a schedule that is sane for its members is to look at the reported availability of those people. If a candidate has expressed that they have a conflict with on-call at some point in time, they cannot be put on-call during that time if at all possible.
The process goes like this. First of all, on-call candidates express their constraints somewhere. This “somewhere” could be their personal calendars, a shared calendar, a spreadsheet, even a whiteboard; it doesn’t really matter where, but this information should be visible to everyone in a well-known location. For our example’s sake, the board could look like this:
With this information in place, it is now possible to generate a schedule that avoids all of those conflicts and that gives every individual the same number of shifts:
As you can imagine, doing this by hand is tedious, especially when there are more than 4–5 people involved. A little bit of automation goes a long way, so a key detail here is that the information about conflicts must be programmatically accessible (which rules out the whiteboard).
The obvious question that arises when proposing this kind of process is: what happens if there are no candidates for a shift? Well, that problem existed before too; it’s just that it was invisible to the team. Solving this situation is a management problem. If you are using automation to prepare the schedule, the tool should flag this condition for manual attention and leave the shift unstaffed.
With this, we have solved the problem of avoiding personal conflicts. But in this example, we still have the problem that we are planning about 6 months worth of on-call shifts at a time, so people must make personal plans with really long lead times—and reacting to unpredicted conflicts and events (such as team attrition) is difficult.
The next tweak to the scheduling process is to shorten the planning horizon: instead of generating schedule batches that include one or more shifts for every person in the team (which is what those empty dividers in the tables above represented, by the way), the shifts are scheduled in small increments on a rolling basis.
In the limit, new shifts could be announced only days before they go live, but it is better to give people about a month of notice. This provides a good balance between knowing when a person will be on-call to juggle their other work responsibilities and the ability for them to make personal plans with shorter lead times.
Going back to our example, when we are in week 3 or 4, we could compute the candidates for weeks 8 and 9, and only schedule those, leaving the rest unscheduled until week 5 or 6 rolls in:
Which brings us to a new problem: how do we ensure that the schedule is balanced so that every team member is on-call about the same amount of time as everyone else? If we are only scheduling a few shifts at a time, someone might be unlucky and be scheduled too often, or we might “forget” to schedule someone often enough.
To fix this, the scheduling process must account for historical data and keep a tally of how many times every candidate has been on-call. The length of this loopback horizon is dependent on how long your on-call shifts are and how stable your team is. You may need to play with different values to see how they impact the generated rotation.
Something else you’ll have to deal with are new team members. These team members won’t have any on-call history, so a naïve attempt at implementing an incremental scheduling algorithm could put those individuals on-call in consecutive shifts to re-balance the schedule. This is obviously undesirable. One way to address this is to factor into account minimum rest times between shifts and assume that new on-callers will be scheduled relatively frequently at the beginning of their role (which isn’t necessarily a bad thing, as that can help with training). Another way is to address these shifts out of band, scheduling them by hand until the schedule is balanced and the new candidates are sufficiently trained.
Establishing the ideas presented in this post requires some tooling. We had a rather advanced tool at Google that many teams used, and I’ve been building my own simplified version at Microsoft to help my team first and maybe grow it later.
Yes, such a tool has to implement a constraints-solving algorithm that accounts for unavailability periods, historical data, and rest times—which can be very difficult to develop in general terms. But keep in mind that the output of such a tool does not have to be perfect. The most important thing to do is to respect people’s reported unavailability. Any deficiencies in the generated schedule can be amended by hand, and the tool can then be improved iteratively.
Also note that, the larger the team and the shorter the shifts, the easier it is for a simple tool to work. And it’s in those scenarios where automation will help the most anyway, because in small teams, you can plan on-call using a whiteboard.
The way a team schedules its on-call shifts speaks volumes as to how much it cares about its people. By accounting for people’s personal life events and by scheduling shifts closer to when they happen, on-call duties will interfere less with people’s life and they’ll be more satisfied with their job responsibilities.
Note that all of the examples above talked about a schedule with just a primary on-call. That’s… not common given that on-call rotations typically account for a backup as well. Factoring that into the boards and any automation is left as an exercise to the reader 🙃.
And finally, because this is the first post of 2022, have a happy new year!
I’m currently looking for a Senior Software Engineer to help with problems like this and many others in Azure Storage. Please reach out if you want to learn more.