In the previous post, we saw why waiting for a process group to terminate is important (at least in the context of Bazel), and we also saw why this is a difficult thing to do in a portable manner. So today, let’s dive into how to do this properly on a Linux system.

On Linux, we have two routes: using the child subreaper feature or using PID namespaces. We’ll focus on the former because that’s what we’ll use to fix (#10245) the process wrapper1, and because they are sufficient to fully address our problem.

But before going into what the child subreaper feature does, let’s recap what happens to orphaned processes.

When a process terminates, the kernel must keep some of its information around—the exit code, among other details—until its parent collects such details. The process is said to be in the zombie state at this point because the process still exists on the system tables but is dead. When the parent collects the details of the process via a call to wait(2), the kernel doesn’t need to keep the state of the process any longer and the zombie disappears.

Processes also have a pointer to their parent (the parent PID, or PPID). So… what happens if such parent becomes a zombie and then disappears? Its children become orphan, and they must be assigned to a new parent to keep the PPID valid. The traditional solution in Unix is to give those processes to PID 1, or init(8), which in turn takes care of clearing their state when they terminate. The way the init(8) daemon does this is with a “reaper” loop that repeatedly calls wait(2) to discard the state of these adopted processes when they terminate.

The child subreaper feature allows us to change the process that becomes the new parent. By using this feature on a given process, we tell the kernel that, when a grandchild of that process becomes orphan, we want to take ownership of it. We have essentially replaced init(8)’s responsibilities for a subset of the process tree.

So let’s go back to our sample code to fix it:

#include <sys/wait.h>
#include <sys/prctl.h>

#include <assert.h>
#include <err.h>
#include <errno.h>
#include <stdlib.h>
#include <unistd.h>

// Convenience macro to abort quickly if a syscall fails with -1.
#define CHECK_OK(call) if (call == -1) err(EXIT_FAILURE, #call);

int main(int argc, char** argv) {
    if (argc < 2) {
        errx(EXIT_FAILURE, "Must provide a program name and arguments");
    }

    // Configure ourselves to act as the child subreaper, in essence replacing
    // the functionality of init(8).  We should do this before forking to avoid
    // a race between our child spawning subprocesses and us doing this
    // operation.
    if (prctl(PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0) == -1) {
        err(EXIT_FAILURE, "prctl");
    }

    int fds[2];
    CHECK_OK(pipe(fds));
    pid_t pid;
    CHECK_OK((pid = fork()));

    if (pid == 0) {
        // Enter a new process group for all of our descendents.
        CHECK_OK(setpgid(getpid(), getpid()));

        // Tell the parent that we have successfully created the group.
        CHECK_OK(close(fds[0]));
        CHECK_OK(write(fds[1], "\0", sizeof(char)));
        CHECK_OK(close(fds[1]));

        // Execute the given program now that the environment is ready.
        execv(argv[1], argv + 1);
        err(EXIT_FAILURE, "execv");
    }

    // Wait until the child has created its own process group.
    //
    // This is a must to prevent a race between the parent waiting for the
    // group and the group not existing yet, and is the only safe way to do so.
    CHECK_OK(close(fds[1]));
    char dummy;
    CHECK_OK(read(fds[0], &dummy, sizeof(char)));
    CHECK_OK(close(fds[0]));

    // Wait for the direct child to finish.  We do this separately to collect
    // and propagate its exit status.
    int status;
    CHECK_OK(waitpid(pid, &status, 0));

    // And now wait for any other process to terminate.  We don't care about
    // them being in our process group any longer.
    while (wait(NULL) != -1) {
        // Got a child.  Wait for more.
    }
    assert(errno == ECHILD);

    return WIFEXITED(status) ? WEXITSTATUS(status) : EXIT_FAILURE;
}

There are two changes to notice compared to the previous post:

  • The call to prctl(2)2 with a first argument of 1, which configures our tool to become the child subreaper before it spawns any subprocess. (This API is not cryptic at all.)

  • More subtly, we now use wait(2) instead of waitpid(2) to wait for all stray subprocesses. This is because, as the acting child subreaper, we will be assigned any subprocess that terminates even if said process decided to escape the process group—which is nice.

If we build our sample program and run the same command as in the previous post:

$ ./wait-all-linux /bin/sh -c '/bin/sh -c "sleep 5; echo 2" & echo 1'
1
2
$

You’ll see that wait-all-linux did not terminate until the nested subshell did, even though the outer shell exited quickly.


  1. PID namespaces are already in use by the linux-sandbox tool in Bazel, but note that using them can cause other kinds of subtle problems due to PID virtualization. ↩︎

  2. FreeBSD also provides the child subreaper feature via procctl(PROC_REAP_ACQUIRE) since the 10.2 release so this post could apply to that system as well. I haven’t found any other system with this though. ↩︎