One of the pending to-do entries for ATF 0.4 is (was, mostly) the ability to define a timeout for a test case after which it is forcibly terminated.  The idea behind this feature is to prevent broken tests from stalling the whole test suite run, something that is already needed by the factor(6) tests in NetBSD.  Given that I want to release this version past weekend, I decided to work on this instead of delaying it because... you know, this sounds pretty simple, right? Hah!

What I did first was to implement this feature for C++ test programs and added tests for it.  So far, so good.  It effectively was easy to do: just program an alarm in the test program driver and, when it fires, kill the subprocess that is executing the current test case. Then log an appropriate error message.

The tests for this feature deserve some explanation.  What I do is: program a timeout and then make the test case's body sleep for a period of time.  I try different values for the two timers and if the timeout is smaller than the sleeping period, then the test must fail or otherwise there is a problem.

The next step was to implement this in the shell interface, and this is where things got tricky.  I did a quick and dirty implementation, and it seemed to make the same tests I added for the C++ interface pass.  However, when running the bootstrap testsuite, it got stalled at the cleanup part.  Upon further investigation, I noticed that there were quite a lot of sleep(1) processes running when the testsuite was stalled, and killing them explicitly let the process continue.  You probably noticed were the problem was already.

When writing a shell program, you are forking and executing external utilities constantly, and sleep(1) is one of them.  It turns out that in my specific test case, the shell interpreter is just waiting for the sleep subprocess to finish (whereas in the C++ version everything happens in a single process).  And, killing a process does not kill its children.  There you go.  My driver was just killing the main process of the test case, but not everything else that was running; hence, it did not die as expected, and things got stalled until the subprocesses also died.

Solving this was the fun part. The only effective way to make this work is to kill the test case's main process and, recursively, all of its children. But killing a tree of processes is not an easy thing to do: there is no system interface to do it, there is no portable interface to get a list of children and I'm yet unsure if this can be done without race conditions.  I reserve the explanation of the recursive-kill algorithm I'm using for a future post.

After some days of work, I've got this working under Mac OS X and also have got automated tests to ensure that it effectively works (which were the hardest part by far).  But as I foresaw, it fails miserably under NetBSD: the build was broken, which was easy to fix, but now it also fails at runtime, something that I have not diagnosed yet. Aah, the joys of Unix...