Showing 4 posts
I'm learning Python these days while writing an script to automate the testing of ATF under multiple virtual machines. I had this code in a shell script, but it is so ugly and clumsy that I don't even dare to add it to the repository. Hopefully, the new version in Python will be more robust and versatile enough to be published. One of the things I've been impressed by is the subprocess module and, in special, its Popen class. By using this class, it is trivial to spawn subprocesses and perform some IPC with them. Unfortunately, Popen does not provide any way to silence the output of the children. As I see it, it'd be nice if you'd pass an IGNORE flag as the stdout/stderr behavior, much like you can currently set those to PIPE or set stderr to STDOUT. The following trivial module implements this idea. It extends Popen so that the callers can pass the IGNORE value to the stdout/stderr arguments. (Yes, it is trivial but it is also one of the first Python code I write so... it may contain obviously non-Pythonic, ugly things.) The idea is that this exposes the same interface so that it can be used as a drop-in replacement. OK, OK, it lacks some methods and the constructor does not match the original signature, but this is enough for my current use cases! import subprocess IGNORE = -3 STDOUT = subprocess.STDOUT assert IGNORE != STDOUT, "IGNORE constant is invalid" class Popen(subprocess.Popen): """Extension of subprocess.Popen with built-in support for silencing the output channels of a child process""" __null = None def __init__(self, args, stdout = None, stderr = None): subprocess.Popen.__init__(self, args = args, stdout = self._channel(stdout), stderr = self._channel(stderr)) def __del__(self): self._close_null() def wait(self): r = subprocess.Popen.wait(self) self._close_null() return r def _null_instance(self): if self.__null == None: self.__null = open("/dev/null", "w") return self.__null def _close_null(self): if self.__null != None: self.__null.close() self.__null = None assert self.__null == None, "Inconsistent internal state" def _channel(self, behavior): if behavior == IGNORE: return self._null_instance() else: return behaviorBy the way, somebody else suggested this same thing a while ago. Don't know why it hasn't been implemented in the mainstream subprocess module.
January 5, 2009
·
Tags:
<a href="/tags/process">process</a>, <a href="/tags/python">python</a>
Continue reading (about
2 minutes)
Now that you know the procedure to kill a process tree, I can explain how the automated tests for this feature work. In fact, writing the tests is what was harder due to all the race conditions that popped up and due to my rusty knowledge of tree algorithms. Basically, the testing procedure works like this:Spawn a complete tree of processes based on a configurable degree D and height H.Make each child tell the root process its PID so that the root process can have a list of all its children, be them direct or indirect, for control purposes.Wait until all children have reported their PID and are ready to be killed.Execute the kill-tree algorithm on the root process.Wait until the children have died.Check that none of the PIDs gathered in point 2 are still alive (which could be, but reparented to init(8) if they were not properly killed). If some are, the recursive kill failed.The tricky parts were 3 and 5. In point 3, we have to wait until all children have been spawned. Doing so for direct children is easy because we spawned them, but indirect ones are a bit more difficult. What I do is create a pipe for each of the children that will be spawned (because given D and H I can know how many nodes there will be) and then each child uses the appropriate pipe to report its PID to the parent when it has finished initialization and thus is ready to be safely killed. The parent then just reads from all the pipes and gets all the PIDs. But what do I mean with safely killed? Preliminary versions of the code just ran through the children's code and then exited, leaving them in zombie status. This worked in some situations but broke in others. I had to change this to block all children in a wait loop and then, when killed, take care to do a correct wait for all of its respective children, if any. This made sure that all children remained valid until the attempt to kill them. In point 5, we have to wait until the direct children have returned so that we can be sure that the signals were delivered and processed before attempting to see if there is any process left. (Yes, if the algorithm fails to kill them we will be stalled at that point.) Given that each children can be safely killed as explained above, this wait will do a recursive wait along all the process tree making sure that everything is cleaned up before we do the final checks for non-killed PIDs. This all sounds very simple and, in fact, looking at the final code it is. But it certainly was not easy at all to write, basically because the code grew in ugly ways and the algorithms were much more complex than they ought to be.
January 17, 2008
·
Tags:
<a href="/tags/atf">atf</a>, <a href="/tags/process">process</a>
Continue reading (about
3 minutes)
Yesterday I mentioned the need for a way to kill a tree of processes in order to effectively implement timeouts for test cases. Let's see how the current algorithm in ATF works: The root process is stopped by sending a SIGSTOP to it so that it cannot spawn any new children while being processed.Get the whole list of active processes and filter them to only get those that are direct children of the root process.Iterate over all the direct children and repeat from 1, recursively.Send the real desired signal (typically SIGTERM) to the root process.There are two major caveats in the above algorithm. First, point 2. There is no standard way to get the list of processes of a Unix system, so I have had to code three different implementations so far for this trivial requirement: one for NetBSD's KVM, one for Mac OS X's sysctl kern.proc node and one for Linux's procfs. Then, and the worst one, comes in point 4. Some systems (Linux and Mac OS X so far) do not seem to allow one to send a signal to a stopped process. Well, strictly speaking they allow it, but the second signal seems to be simply ignored whereas under NetBSD the process' execution is resumed and the signal is delivered. I do not know which behavior is right. If we cannot send the signal to the stopped process, we can run into a race condition: we have to wake it up by sending a SIGCONT and then deliver the signal, but in between these events the process may have spawned new children that we are not aware of. Still, being able to send a signal to a stopped process does not completely resolve the race condition. If we are sending a signal that the user can reprogram (such as SIGTERM), that process may fork another one before exiting, and thus we'd not kill this one. But... well... this is impossible to resolve with the existing kernel APIs as far as I can tell. One solution to this problem is killing a timed-out test by using SIGKILL instead of SIGTERM. SIGKILL could work on any case because means die immediately, without giving a chance to the process to mess with it. Therefore SIGCONT would not be needed in any case &mash;because you can simply kill a stopped process and it will die immediately as expected— and the process would not have a chance to spawn any more children after it had been stopped. Blah, after writing this I wonder why I went with all the complexity of dealing with signals that are not SIGKILL... say over-engineering if you want...
January 16, 2008
·
Tags:
<a href="/tags/atf">atf</a>, <a href="/tags/portability">portability</a>, <a href="/tags/process">process</a>
Continue reading (about
3 minutes)
One of the pending to-do entries for ATF 0.4 is (was, mostly) the ability to define a timeout for a test case after which it is forcibly terminated. The idea behind this feature is to prevent broken tests from stalling the whole test suite run, something that is already needed by the factor(6) tests in NetBSD. Given that I want to release this version past weekend, I decided to work on this instead of delaying it because... you know, this sounds pretty simple, right? Hah! What I did first was to implement this feature for C++ test programs and added tests for it. So far, so good. It effectively was easy to do: just program an alarm in the test program driver and, when it fires, kill the subprocess that is executing the current test case. Then log an appropriate error message. The tests for this feature deserve some explanation. What I do is: program a timeout and then make the test case's body sleep for a period of time. I try different values for the two timers and if the timeout is smaller than the sleeping period, then the test must fail or otherwise there is a problem. The next step was to implement this in the shell interface, and this is where things got tricky. I did a quick and dirty implementation, and it seemed to make the same tests I added for the C++ interface pass. However, when running the bootstrap testsuite, it got stalled at the cleanup part. Upon further investigation, I noticed that there were quite a lot of sleep(1) processes running when the testsuite was stalled, and killing them explicitly let the process continue. You probably noticed were the problem was already. When writing a shell program, you are forking and executing external utilities constantly, and sleep(1) is one of them. It turns out that in my specific test case, the shell interpreter is just waiting for the sleep subprocess to finish (whereas in the C++ version everything happens in a single process). And, killing a process does not kill its children. There you go. My driver was just killing the main process of the test case, but not everything else that was running; hence, it did not die as expected, and things got stalled until the subprocesses also died. Solving this was the fun part. The only effective way to make this work is to kill the test case's main process and, recursively, all of its children. But killing a tree of processes is not an easy thing to do: there is no system interface to do it, there is no portable interface to get a list of children and I'm yet unsure if this can be done without race conditions. I reserve the explanation of the recursive-kill algorithm I'm using for a future post. After some days of work, I've got this working under Mac OS X and also have got automated tests to ensure that it effectively works (which were the hardest part by far). But as I foresaw, it fails miserably under NetBSD: the build was broken, which was easy to fix, but now it also fails at runtime, something that I have not diagnosed yet. Aah, the joys of Unix...
January 15, 2008
·
Tags:
<a href="/tags/atf">atf</a>, <a href="/tags/netbsd">netbsd</a>, <a href="/tags/process">process</a>
Continue reading (about
3 minutes)