Quoting the Wikipedia as is becoming customary: logging is the cutting, skidding, on-site processing, and loading of trees or logs onto trucks or skeleton cars.
Huh, wait… I think I got that definition wrong. Let me try my own:
Logging is the recording of application activity on-the-fly onto a reliable storage system. Having this functionality in your software is something you don’t think about until you need to debug an issue and, well, then it is too late. When your application crashes, especially in production, the more information you have access to, the better. A detailed record of the program activity that led to the crash is often incredibly insightful. (Hint: compare to flight data recorders.)
You’d think that recording “everything” is the best course of action: after all, having all details available can really point you to the problem! However, what does “everything” mean? And, also, can you tolerate the penalty incurred by the logging operations when you are capturing that much data? Keep in mind that, usually, the performance-critical sections of your program are the most interesting ones to monitor… but adding logging calls to those pieces of code can kill their performance.
So let’s see what to do. The sections below cover various details involved in the logging “infrastructure” that your application should have: how to organize the log, how to present it and how to record it.
Every item logged by your application should have a severity level attached to it. The most common severity levels are, in order of importance:
- ERROR: The program has reached an invalid state and cannot continue. This kind of message is usually followed by the immediate termination of the program.
- WARNING: The program has detected an invalid state but can continue without having to exit (either by recovering from the problem or by ignoring it). For example: in the case of a server, the server may encounter an erroneous situation while serving the request of a specific user. This condition is an error for that specific request and does not necessarily have to cause the server to abort: only that specific user request should to be cancelled.
- INFO: The program is performing just fine and is just recording informational messages to help the user understand what is going on. Usually, the longer a part of the program takes, the more information it should dump to let the user know that the process is not stuck.
- DEBUG: The program is still performing just fine and is dumping more information than usual that may help the user (developer, sysadmin, etc.) troubleshoot a problem that the process may be encountering.
The classification of messages in these groups is often easy. Also, pretty much all logging modules out there implement the levels detailed above and more. As an example, take a look at the Python logging module, the Google logging library for C++, Boost.Log or the logging library provided by Kyua.
But why do log levels exist after all? I can imagine two reasons:
The most obvious reason is the classification of the messages based on their criticality: an INFO-level message is nothing to worry about while a WARNING-level message might be. Every class of user of your program may be interested in different aspects of the log: e.g. a programmer may want to get down to the DEBUG levels while an end user may only want to see actual ERRORs.
The other, not-so-obvious reason is that logging is an expensive thing to do: formatting log messages and persisting them has a cost. A high rate of logging operations can incur a significant performance penalty in your program.
Both of these reasons brings us to the next point, which is providing a mechanism to filter out undesired log entries.
Log level selection
Alright, so you have sprinkled log messages throughout your codebase and have carefully classified them according to their severity level. Your user won’t be happy if you dump all those messages unconditionally: again, recording all these log messages has a cost and that cost can be significant. How do you improve the user experience?
Easy: allow the user to select the maximum severity level they wish to see. For example, a setting like
--loglevel=warning would tell your program to only print ERROR-level and WARNING-level messages while discarding all others, hopefully minimizing the run-time impact of the calls to the logging library for the latter.
All levels filtered out should be ignored, and the cheaper you can make this message discarding operation at run-time, the better. As an example of this, I’d like to refer you to the Python logging library again. A common mistake when invoking its functionality is by doing the following:
logging.info('Received request from client %s, port %d' % (client, port))
This call will work but it is inefficient. Before the call to
logging.info can be made, Python has to construct the single argument given to it, which involves formatting a string — an expensive operation. Instead, the way to call such functions in Python is:
logging.info('Received request from client %s, port %d', client, port)
which delays the string formatting until the
logging library chooses to perform it. If logging for this particular severity level is disabled, then the formatting never happens and the cost of the call is reduced to hopefully insignificant levels.
Do do you persist the log entries? Of course yes: if the messages exist but it is not possible to read them, they are worse than useless. So how do you persist them? A few ideas:
- Console: print the log to the
stderrof your program. In some cases, like in a server process, this may be a reasonable idea. In other cases, like purely user-facing tools, it might not be because log messages are not really targetted to the user and they may pollute your interface.
- File: send the log to a specific file on disk. Note that this does not mean the same file for every single execution: you could (and should) timestamp the files. Why? Because you don’t want a crash-looping process from overriding the log messages over and over again. Maybe the log of the first crash is the only one that holds enough details to understand what led to the problem!
- syslog: There are many variants of this, including the radically-different systemd journal (for which I have been unable to find any up-to-date “web page” as a reference). Using a logger service gives the system administrator quite a detailed mechanism to control where logs go. This is a pretty good alternative for server processes but not such a good idea for interactive command-line applications. (Users are not able to customize these services for their own needs.)
The whole point of recording the log is to have it available when you need it, and that means you have to record it before you even know it will be needed. In other words: if you are interested in seeing the log of the program when it crashes, the log has to be enabled by default on every invocation so that you can get such details post-facto.
And even more: if you record the log by default behind the user’s back, you have to tell the user where such logs exist whenever they need them. For example, if the program crashes, you could tell the user that a log has been recorded and ask him to send such log to your bug tracker. Kyua does precisely that.
Everything described in this post applies pretty much to any application. However, logging mechanisms are especially important for programs deployed to production. Production programs run remotely and “out of your sight” so, when they fail, you really need to have access to their logs to understand what happened. Make sure you have access to those precious logs.