Waiting for process groups, introduction

Process groups are a feature of Unix systems to group related processes under a common identifier, known as the PGID. Using the PGID, one can look for these related process and send signals in unison to them. This is typically used by shell interpreters to manage processes.

For example, let’s launch a shell command that puts two sleep invocations in the background (those with the 10- and 20-second delays) and then sleeps the direct child (with a 5-second delay)—while also putting the whole invocation in the background so that we can inspect what’s going on:

$ /bin/sh -c 'sleep 10 & sleep 20 & sleep 5' &
[1] 799
$ ps -o pid,pgid,command
  PID  PGID COMMAND
  612   612 -zsh
  799   799 /bin/sh -c sleep 10 & sleep 20 & sleep 5
  800   799 sleep 10
  801   799 sleep 20
  802   799 sleep 5
  803   803 ps -o pid,pgid,command

In the output of ps, we can observe two interesting things: first, the PGID column has the value 799 for all processes that belong to that invocation; and second, there is one process with the same PID as PGID. That process is known as the process group leader and corresponds to the command we typed.

As we covered in the previous post, Bazel’s process-wrapper helper tool uses process groups as a mechanism to terminate the process it directly spawns (e.g. a test program) and any other subprocesses that this first process might have spawned (e.g. helper tools run by the test program). In essence, the process wrapper does setgpid(getpid(), getpid()) immediately before the call to exec(3) to place its direct child in a new process group—just as the shell example above did—and then uses this handle to terminate the whole group at once with kill(-PGID, SIGKILL).

But you must remember that kill(2) just posts the signal to the given process(es). There is no guarantee that the signal was delivered and handled once kill(2) returns. The signal is processed at a later time (or maybe not at all, if the signal can be and is ignored)—and if the process is blocked in the kernel, whatever thing it is doing may be allowed to complete before the signal takes effect.

Which poses a problem: the Bazel process wrapper is supposed to abide by the contract that, once it terminates, the command given to it has also terminated. If the process wrapper did not wait for all of its descendent processes to fully terminate, we would violate that contract and we would experience very difficult-to-diagnose race conditions. All hypotheticals… right? Well, no, because the process wrapper does not actually wait for the process group to complete, so we have a bug (#10245). Stay calm though, because the bug is extremely hard to hit (we use SIGKILL, not SIGTERM), and it’s only a correctness issue if you are playing with my new --experimental_local_lockfree_output feature combined with dynamic execution.

So… how do we actually wait for all processes in the same process group to terminate? Let’s start by looking at the documentation of waitpid(2). Quoting the macOS 10.15 manual page, we’ll see:

The pid parameter specifies the set of child processes for which to wait. If pid is -1, the call waits for any child process. If pid is 0, the call waits for any child process in the process group of the caller. If pid is greater than zero, the call waits for the process with process id pid. If pid is less than -1, the call waits for any process whose process group id equals the absolute value of pid.

(Emphasis mine.) That sounds promising! It sounds like, if we just do waitpid(-PGID, NULL, 0) repeatedly until we get ECHILD, we’ll wait for all subprocesses in the group after we have sent them the termination signal.

Let’s try! Build this sample program, which is intended to spawn the command given as arguments:

#include <sys/wait.h>

#include <assert.h>
#include <err.h>
#include <errno.h>
#include <stdlib.h>
#include <unistd.h>

// Convenience macro to abort quickly if a syscall fails with -1.
//
// Not great error handling, but better have some than none given that you, the
// reader, might be copy/pasting this into real production code.
#define CHECK_OK(call) if (call == -1) err(EXIT_FAILURE, #call);

int main(int argc, char** argv) {
    if (argc < 2) {
        errx(EXIT_FAILURE, "Must provide a program name and arguments");
    }

    int fds[2];
    CHECK_OK(pipe(fds));
    pid_t pid;
    CHECK_OK((pid = fork()));

    if (pid == 0) {
        // Enter a new process group for all of our descendents.
        CHECK_OK(setpgid(getpid(), getpid()));

        // Tell the parent that we have successfully created the group.
        CHECK_OK(close(fds[0]));
        CHECK_OK(write(fds[1], "\0", sizeof(char)));
        CHECK_OK(close(fds[1]));

        // Execute the given program now that the environment is ready.
        execv(argv[1], argv + 1);
        err(EXIT_FAILURE, "execv");
    }

    // Wait until the child has created its own process group.
    //
    // This is a must to prevent a race between the parent waiting for the
    // group and the group not existing yet, and is the only safe way to do so.
    CHECK_OK(close(fds[1]));
    char dummy;
    CHECK_OK(read(fds[0], &dummy, sizeof(char)));
    CHECK_OK(close(fds[0]));

    // Wait for the direct child to finish.  We do this separately to collect
    // and propagate its exit status.
    int status;
    CHECK_OK(waitpid(pid, &status, 0));

    // And now wait for any other process in the group to terminate, as the
    // documentation claims.
    while (waitpid(-pid, NULL, 0) != -1) {
        // Got a child.  Wait for more.
    }
    assert(errno == ECHILD);

    return WIFEXITED(status) ? WEXITSTATUS(status) : EXIT_FAILURE;
}

and then run it against the same sample command we used in the previous post:

$ ./wait-all /bin/sh -c '/bin/sh -c "sleep 5; echo 2" & echo 1'
1
$ 2

Uh oh… Notice that wait-all exits quickly and that the subprocess that prints 2 remains, printing this a few seconds later.

Not good. What the documentation for waitpid(2) quoted above fails to mention—and I haven’t been able to find this documented anywhere—is that this call only waits for direct children processes with the given PGID. It will do nothing for grandchildren processes.

And this makes sense if you think about how the wait(2) family of system calls work internally. What these calls do is block until the process receives a SIGCHLD. This signal is only sent from a child process to its parent when the child changes its status; there is no transitive forwarding of signals—hence why waitpid(2) cannot wait for grandchildren.

So how do we fix this? Well… there is no good answer to this question as the solution varies across platforms. And, in fact, it may not be possible to implement at all (correctly) in some, though we can obtain a good approximation.

We’ll dive into the possible alternatives for Linux and macOS in the next posts, which will help us fix the bug that currently exists in Bazel’s process wrapper.

See the followup posts with Linux-specific and macOS-specific details.

Waiting for process groups, introduction

Featured software

Featured posts