Companies grow, and with them do the software projects that support them. It should be no surprise that larger programs require longer build times. And, if I had to guess, you have seen how those build times eventually grow to unbearable levels, reducing productivity and degrading quality. Once that happens, and if you are lucky, some engineers might be
allowed tasked to speed things up.
There are many techniques to reduce build times. Some involve infrastructure improvements such as using tooling that prevents clean builds and adding caching to reuse artifacts—though these only provide one-time payoffs. Others involve carefully analyzing and untangling the dependencies of the project to reduce bloat—a very expensive process. And others can be as mundane as refreshing the hardware of all developers—an option is only possible every few years.
No matter which path(s) you take, however, the reductions in build time tend to be transient. You see: developers seem to have an upper bound on how much pain they will put up with, and build times will slowly creep up until they hover around this limit. And there is nothing you can do because software bloat is a fact of life and the product must grow continuously. Or is there?
We must realize that there are two kinds of build time growth:
Intentional growth: Intentional changes to the software will require the build time to grow, but because they are intentional, we can evaluate whether they are worth their cost. For example: customers may request a new feature, needing a bunch of new code; or production issues dictate that a piece of code be rewritten in a compiled language, introducing extra code to build. Yet these are all explicit choices that we make.
Accidental growth: Poorly executed code changes can also result in increased build times, and these changes come in very subtly: a misplaced
#includeof a gigantic file inside a common
.hfile that now impacts all C++ compilations in the project; a new dependency on a third-party library that mistakenly pulls in the whole library instead of just the specific chunk that is necessary; you get the idea.
Of these two, it’s the accidental growth that is truly problematic because it’s incredibly easy for it to slip into the project over time. It is the kind of growth that goes unnoticed because it slowly builds up until things are so bad that something must be done. And at that point, there is so much to do that justifying the cost of fixing the issues is near to impossible.
So the question is: once we have achieved fast build times, how do we keep them fast? Alternatively, if we don’t have fast builds yet but want to stop the bleeding, how do we prevent regressions?
The answer lies in measuring. We must treat build times as yet another health metric for our product along with, for example, customer-observed request latencies or rate of failed requests. Only with metrics (SLIs) and targets (SLOs) around the build cycle we can spot build time growth as it accumulates. Once we spot it, we can dig down and see if it was intentional or accidental. And if it was accidental, we can take corrective action soon after the defect was introduced.
In this post, I’ll describe how we can measure build times and how we can establish SLIs for them. Brace up because it’s not as simple as measuring wall time on an dedicated machine and calculating an average.
There are two key dimensions that tie directly into the efficiency of the build cycle and are the two aspects that will drive the definition of our SLIs:
Build interactivity: Whether a human is actively waiting for the build to complete or not.
Interactive builds are those that have a person waiting for their results. These are typically the longest component of the edit/build/test development cycle. These builds must be fast or else the developer will routinely lose focus and switch to a different task. These context switches are frustrating, especially when a quick feedback loop is necessary (think troubleshooting a bug), and expensive.
Non-interactive builds are those that happen “asynchronously” and don’t have anyone actively waiting for them. Nightly builds, for example, have to be published daily, but a day is pretty long and we can afford some variability. More interestingly, though, CI builds that run as part of PR validation are also non-interactive: yes, someone may be waiting for their results to merge the PR, but these workflows are already expected to be slow so people purposely choose to context switch and do something else while they wait.
Build incrementality: Whether the build does the minimal amount of work for a code change or not.
Incremental builds are those that can reuse build artifacts from a previous run. In general, the scope of incremental builds should be minimal. For example, if you touch a source file, you would want a rebuild of that single file, a relink of the intermediate library it belongs to, and a relink/rebundle of the final application that consumes that library. No other source file should be rebuilt.
Clean builds are builds that do much more than that. In the obvious case, if there are no previous artifacts (such as after a
make clean), the build will have to compile everything. But there are more subtle cases that should be considered clean builds too. Think about what happens when you do a
git pull: after synchronizing your source tree with other people’s changes, you will have to rebuild much more than just what you are working on. Or think about what happens when you change from debug mode to release mode: you will likely have to rebuild almost-everything as well.
These two dimensions are critical in defining our build metrics, but they may not be sufficient. The following are other dimensions you may want to consider. Whether they are important or not will depend on how varied the behavior of your developer population is:
Build configuration: The speed of a build depends on its configuration. Building debug artifacts, for example, is often cheaper CPU-wise than building release artifacts.
Build hardware: The characteristics of the machines your developers use. If, for example, you intentionally have developers building on MacBook Airs (not the M1 kind) and high-performance Linux workstations, you will observe very different build times and may want to track them separately.
I’d suggest you ignore these extra dimensions when you start defining your SLIs though. You may not need them to implement good metrics, and the less complexity involved in them, the better.
Now that we know that interactivity and incrementality are the primary dimensions to back our SLIs, we can start to define them. The goal is to end up with graphs that show us these numbers as a trend line, and some automated alerts that tell us when the goals are exceeded. Here are some examples:
|Interactive clean build time (debug)
|≤ 10 min.
|Interactive incremental build time (debug)
|≤ 30 sec.
|CI run build time (debug+release)
|≤ 15 min.
The specific numbers in the SLOs are irrelevant: they will depend directly on how your builds currently behave or how you’d like them to behave. Choose numbers that make sense in your context.
Note that I have assumed that the developers only use the debug configuration when they do interactive work. This may or may not be true in your case, or you may choose to not care about the distinction. I have also assumed that CI runs always execute clean builds (the common case due to typical limitations in the tooling), but your infrastructure may already be smarter than that.
Anyhow. At first sight, these proposed metrics sound very simple. In reality, though, implementing them in a way that provides stable and useful signal can be very difficult. So let’s dig in.
The first question that invariably arises when proposing metrics like “interactive build time” is: how do we gather such data?
For CI builds, this is trivial. The CI system already publishes artifacts and logs to a central server, so you can easily generate extra information about build statistics from within it. Not very interesting.
The more interesting problem is how to obtain metrics from the interactive builds. These happen on the developers’ machines and, because that’s how things have always been, you likely have very little visibility into them. Yet this must change if you truly want to make a difference for the important use case of speedy build interactions.
What we need is a mechanism to push build-related metrics from the developers’ machines to a central location for further analysis: aka, we need some form of telemetry—as dreaded as it may be. This is not something that build tools provide in the common case. But some do! Bazel, for example, implements the Build Event Protocol (BEP) and can be configured (it is off by default) to publish per-build information to a remote server. You can enable this feature at a repository level via a custom, checked-in
But, of course, you don’t need Bazel nor the BEP to gather something as simple as build times. In the ugliest form, you could imagine wrapping the build command that your developers use with a call to
Measure-Command (why, yes, I’ve been learning PowerShell), extract the printed out metrics, and issue a simple asynchronous
POST HTTP request to a server to publish the data.
The way you go about obtaining build metrics from your developer population is up to you, but you must find out how to do it before continuing. Whatever you do, keep in mind that you don’t need 100% reliability. If you fail to capture some builds from some developers because their network connection was spotty or they were on a plane, that’s likely fine. We will be using aggregates to implement the SLIs and we don’t need perfection.
Build time representation
Once you have a mechanism to collect details on every build that happens on the field and store those in a central timeseries database, we can finally start computing the metrics we care about.
If you are unsure about what to collect, think about the following when defining your schema:
|Needed to generate a timeseries.
|Uh, we are tracking build times.
|Needed to remove noise and (possibly) to split CI from interactive builds.
|Determine the nature of the invocation (build, test, clean).
|Break down the SLI if we need to. Also useful for troubleshooting. Includes hostname, number of CPUs, etc.
|Troubleshooting. Includes commit ID, modified files, etc.
Be careful about the “every build” part though. For example: you may not want “clean” operations (such as
make clean) to disturb your metrics and so you may decide to weed them out early on in the data collection process. I’d suggest that you collect these invocations too, but then also collect sufficient details (such as the command line) to filter them out later. The reason is that, once you have these data, you can perform other interesting analyses: “how frequently do developers clean their output trees?” or “how many build attempts do they make before running tests?”.
And that’s pretty much it. Note the glaring omission of interactivity: apparently, there is nothing in the list of details above to capture it, and that’s on purpose. Avoiding this is the key behind this whole post.
So, how do we measure “incremental” and “clean” build times then?
We have to look at aggregates and percentiles. Think about how the development workflow of any given person looks like. In the general case, we can expect developers to do one or two clean builds a day (remember that pulling new sources implies a clean build) and many, many more incremental builds as they go about changing the source code and trying out their changes.
This assumption allows us to model build incrementality via percentiles: the p50 of the build times represents incremental builds, and the p90 or p95 represent clean builds.
I know, I know: this sounds simplistic and unrealistic. I thought so too when I was introduced to this concept. How can you track something with such variability as build times with aggregates alone? In a large codebase, developers work on different parts of the tree, after all, and what one thinks of as an incremental build could be seen as a clean build by others. But these metrics work, and they work well… with some tweaks.
Modeling user behavior
We have a problem though. If we blindly take the p50 or p95 of build times, we might end up with very noisy data: a developer working on base libraries will likely have much longer build times than a developer working on the frontend due to the nature of how much of the project has to be rebuilt in each case. We need to account for these differences in behavior in some way.
This is where you have to be good at statistics. I’m not. Luckily though, I was surrounded by good statisticians in the past who looked into this topic. Their suggestion was to model build times as “percentiles of percentiles”.
Assuming we want to compute our metrics on a daily basis, we start by computing the p95 of the build times of every individual user. This gives us a bunch of data points for every day, and each data point models the clean build time that different users experienced. Once we have these, we then compute the p95 of these p95s for each day, yielding our trend line. In this way, we first model the behavior of each user and then we track the aggregate behavior.
Because we are modeling users individually, this approach also accounts for different build configurations and different hardware characteristics with no additional effort. But… there might still be a lot of noise in the trend lines, and we have to sort that out.
Weeding out the noise
The noise in the SLIs over build times as defined so far will come from two sources. Without addressing them, we may end up with nonsensical trend lines that do nothing to help us.
The first source of noise, as alluded to in the previous section, comes from major differences in user characteristics. The most obvious one would be building on a laptop vs. building on a workstation. If your developer population shows these differences, it may be a good idea to compute different trend lines for those different categories. After all, the vast difference in performance may imply that you need to set different SLOs for them.
The second source of noise comes from abrupt changes in behavior. Obviously, weekends and weekdays will show varying behavior. But so will Mondays and Fridays because, in a normal workweek, Mondays will include a lot more clean builds than other days of the week: the first thing people do when they come into work is a
git pull or its equivalent.
To address this second source of noise, we can compute our timeseries using a rolling window. Instead of basing each day’s data point on that day’s builds alone, we use those builds plus all of the builds that happened in the previous days. More specifically: for each day in the timeseries, we calculate the p95 build time of all the builds a user did on that day plus the prior 6 days, and then compute the p95s of those to obtain the single data point for the day. This will give us pretty stable numbers and smooth lines.
The downside of a multi-day rolling window is that the SLIs will take longer to pick up regressions. The longer it takes to spot a regression, the harder it will be to pinpoint what caused it because there will be more changes to analyze.
Accounting for all of these details so far, we could further refine the sample metrics we started with as:
|7-day interactive workstation clean build time (debug) as the p95 of per-user p95s
|≤ 10 min.
|7-day interactive workstation incremental build time (debug) as the p50 of per-user p50s
|≤ 30 sec.
|7-day interactive laptop clean build time (debug) as the p95 of per-user p95s
|≤ 15 min.
|7-day interactive laptop incremental build time (debug) as the p50 of per-user p50s
|≤ 45 sec.
|7-day CI run build time (debug+release) as the p95 of CI builds
|≤ 15 min.
Once again, remember that the numbers here are made up and they are not targets for your project. They may be way too large or too small; only you will know.
Getting to stable lines that are useful will be difficult though, so expect to spend some time experimenting with the data. And once you have the lines, validate them: if you are lucky enough to already have old build time data, go back in time and see if your new SLIs would have spotted previously-known regressions. And if you don’t have this luxury, keep a close eye and see if the lines spot future regressions. (Shhht, maybe you could introduce one intentionally and see what happens? Will the lines pick it up? Will your developers even notice?)
Let’s conclude by looking at what an ideal world might look like. In such a world, you would like to catch build time regressions before they are checked in. You can imagine some kind of PR merge validation check that measures the build time and blocks the merge if it is greater than what it used to be—unless the PR author justified the increase somehow (e.g. with a specially-tagged note in the PR description).
Unfortunately, this can be very hard to achieve because build times are not super-deterministic. Running build time diagnostics on a PR basis may be too costly resource-wise as you might need to run the same build multiple times to get stable measurements. Or you might need dedicated machines for these builds to avoid neighbor interference, and special-cased hardware is expensive and a pain to maintain. But it is doable.
The trickier problem here is that the kinds of validation you would do in CI will represent one build behavior only: a clean build of the whole project, on one kind of hardware, and on one configuration. And while ensuring this specific combination remains fast is great, it does not represent what your developers are experiencing “on the field”. And that’s what matters.
So, at the end of the day, there is no escape in collecting data from interactive builds and measuring how they do.
Phew. That was much longer than anticipated but we finally reach the end.
In this post, I have covered metrics to keep tabs on build times and explored how to go about defining a robust SLI that can help you maintain fast builds.
But don’t let this stop your imagination. Build times are just one component of the edit/build/test cycle. Test times (which you can also break down as interactive vs. CI) are something else that should be monitored and tracked. And even more subtle things should too! What about the time it takes to fetch new sources from the server? What about getting your IDE up-to-speed after you have updated your source tree?
Make sure to look at the end-to-end development cycle as if it was a product. Once you see it as a product, you can then define use cases (journeys?) through it and you can measure their value and their success.