The scenario of the day: a binary you deployed to production two months ago has been running fine and dandy since then… until today, when the report of a strange and concerning crash arrived. Nothing really unusual: these things happen all the time and need to be dealt with.

But you, as the proud developer of the software, attempt to reproduce the problem on your own system with the most recent sources and… cannot do so: the issue escapes all your tests (again, not unusual). Is the problem really gone or have recent changes to the source hidden the problem by modifying the triggering conditions? Either way you have to verify it because the second case is usually quite scary.

A way to go about this is to attempt to roll the code back to the revision used to build that particular binary, attempt to reproduce the problem again with that build and, with some luck, you may be able to. But hang on…. do you know exactly what revision to roll back to? Do you know if you should be rolling back anything else?

The binary

These questions highlight that you need a mechanism to map, univocally, your production system to the source code used to build it. An obvious way of doing so is by bundling the revision number (from your version control system) against which the binary was built into the binary itself. Having this number inside the binary itself ensures you never lose the mapping and provides you easy access to the information you need. (Bonus points if you dump those details into the log as part of the program startup process so that they are available right inside any crash report.)

The environment

Alright. So you have recorded the revision number used to build your binary; great. Is that enough? What about all those shared libraries your program depends on? What about any external programs you may be invoking? What about the compiler? All these external dependencies can cause your program to misbehave or can hide the issue being looked at if they change.

The answers to these questions are tricky and the way you go about answering them really depends on the structure of your production environment.

One way you could achieve this is by recording the full list of installed packages, both at the time the binary was built (so you capture details like the compiler) and at the time the binary was run. Note that compilation and execution will usually happen in different systems, so recording these two lists is important. If you have these lists at hand, you will be able to roll back any offending components to the right version and have higher chances of successfully reproducing the issue.

Unfortunately, while this may work reasonably well in the majority of Linux distributions, it may not in BSD systems. The reason is that most packaging systems for Linux build packages once and only once for any specific package version. In BSD packaging systems (e.g. ports and pkgsrc), however, it is common to rebuild all packages from scratch for various reasons, and doing so will not change the packages’ version numbers; those numbers only change when a developer explicitly modifies the package. While, in general, such rebuilds will be similar enough, they may not be as details on the host build system may influence them significantly.

TL;DR

This is all tricky as you can see. However, if you don’t have the means to record all the relevant details, start easy! Just record the revision number of the source tree used the build the binary in a durable place and ignore the rest. This alone will save you a lot of time.

This article is part number 6 of 7 of the Production software series.