Portability - Julio Merino (jmmv.dev)

Open files limit, macOS, and the JVM

Bazel’s original raison d’etre was to support Google’s monorepo. A consequence of using a monorepo is that some builds will become very large. And large builds can be very resource hungry, especially when using a tool like Bazel that tries to parallelize as many actions as possible for efficiency reasons. There are many resource types in a system, but today I’d like to focus on the number of open files at any given time (nofiles).

January 29, 2019 · Tags: bazel, jvm, macos, monorepo, portability
Continue reading (about 3 minutes)

#! /usr/bin/env considered harmful

Many programming guides recommend to begin scripts with the #! /usr/bin/env shebang in order to to automatically locate the necessary interpreter. For example, for a Python script you would use #! /usr/bin/env python, and then the saying goes, the script would “just work” on any machine with Python installed.

The reason for this recommendation is that /usr/bin/env python will search the PATH for a program called python and execute the first one found… and that usually works fine on one’s own machine.

September 14, 2016 · Tags: featured, portability, programming, scripts, unix
Continue reading (about 5 minutes)

Using va_copy to safely pass ap arguments around

Update (2014-12-19): The advice provided in this blog post is questionable and, in fact, probably incorrect. The bug described below must have happened for some unrelated reason (like, maybe, reuse of ap), but at this point (three years later!) I do not really remember what was going on here nor have much interest in retrying.

A long time ago, while I was preparing an ATF release, I faced many failing tests and crashes in one of the platforms under test. My memory told me this was a problem in OpenSolaris, but the repository logs say that the problem really happened in Fedora 8 x86_64.
The problem manifested itself as segmentation faults pretty much everywhere, and I could trace such crashes down to pieces of code like the following, of which the C code of ATF is full of:

void
foo_fmt(const char *fmt, ...)
{
    va_list ap;

    va_start(ap, fmt);
    foo_ap(fmt, ap);
    va_end(ap);
}

void
foo_ap(const char *fmt, va_list ap)
{
    char buf[128];

    vsnprintf(buf, sizeof(buf), fmt, ap);

    ... now, do something with buf ...
}

The codebase of ATF provides _fmt and _ap variants for many functions to give more flexibility to the caller and, as shown above, the _fmt variant just relies on the _ap variant to do the real work.
Now, the crashes that appeared from the code above seemed to come from the call that consumes the ap argument, which in this case is vsnprintf. Interestingly, though, all the tests in other platforms but Linux x86_64 worked just fine, and this included OpenSolaris, other Linux distributions, some BSDs and even different hardware platforms.
As it turned out, you cannot blindly pass ap arguments around because they are not "normal" parameters (even though, unfortunately, they look like so!). In most platforms, the ap element will be just an "absolute" pointer to the stack, so passing the variable to an inner function calls is fine because the caller's stack has not been destroyed yet and, therefore, the pointer is still valid. But... the ap argument can have other representations. It'd be an offset to the stack instead of a pointer, or it'd be a data structure that holds all the variable parameters. If, for example, the ap argument held an offset, passing it to an inner function call would make such offset point to "garbage" because the stack would have been grown due to the new call frame. (I haven't investigated what specific representation is x86_64 using.)
The solution is to use the va_copy function to generate a new ap object that is valid for the current stack frame. This is easy, so as an example, we have to rewrite the foo_ap function above as follows:

void
foo_ap(const char *fmt, va_list ap)
{
    char buf[128];
    va_list ap2;

    va_copy(ap2, ap);
    vsnprintf(buf, sizeof(buf), fmt, ap2);
    va_end(ap2);

    ... now, do something with buf ...
}

This duplication of the ap argument pointing to the variable list of arguments ensures that ap2 can be safely used from the new stack frame.

September 12, 2011 · Tags: c, portability
Continue reading (about 3 minutes)

Forget about test(1)'s == operator

Some implementations of test(1), in an attempt to be smart, provide non-standard operators such as ==. Please forget about those: they make your scripts non-portable and a pain to use in other systems. Why? Because, due to the way the shell works, failures in calls to test(1) will often just result in an error message (which may not be seen due to other output) and the script will happily continue running even if it missed to perform some important operation.

So... just use the standard equality operators:

= for string equality comparison.
-eq for numeric equality comparison.

Note that whenever I refer to test(1), I'm also talking about the [ ... ] construction in conditionals.

Also, please note that this also affects configure scripts, and the problem in these appears much more commonly than in other scripts!

April 24, 2010 · Tags: portability
Continue reading (about 1 minute)

Always define an else clause for portability #ifdefs

If you use #ifdef conditionals in your code to check for portability features, be sure to always define a catch-all else clause that actually does something, even if this something is to error out.

Consider the following code snippet, quoted from gamin's tests/testing.c file:

if (arg != NULL) {
#ifdef HAVE_SETENV
  setenv("GAM_CLIENT_ID", arg, 1);
#elif HAVE_PUTENV
  char *client_id = malloc (strlen (arg) + sizeof "GAM_CLIENT_ID=");
  if (client_id)
  {
      strcpy (client_id, "GAM_CLIENT_ID=");
      strcat (client_id, arg);
      putenv (client_id);
  }
#endif /* HAVE_SETENV */
}
ret = FAMOpen(&(testState.fc));

The FAMOpen method queries the GAM_CLIENT_ID environment variable to set up the connections parameters to the FAM server. If the variable is not defined, the connection will still work, even though it will use some default internal value. In the test code above, the variable is explicitly set to let the tests use a separate server instance.

Now, did you notice that we have to conditional branches? One for setenv and one for putenv? It seems reasonable to assume that one or the other must be present on any Unix system. Unfortunately, this is flawed:

What happens if the code forgets to include config.h?
What happens if the configure script fails to detect both setenv and putenv? This is not that uncommon, given how some configure scripts are written.
What happens if neither setenv nor putenv are available?

The answer to the three questions is: in the above code snippet, the code builds just fine but will misbehave at run time: neither HAVE_SETENV nor HAVE_PUTENV are defined, so the code will not be able to define the required environment variable. However, FAMOpen will later be called and it will not behave as expected because the variable has not been set.

Note that this code snippet is just an example. I have seen many more instances of this exact same problem with worse consequences than the above. Read: they were not part of the test code, but just part of the regular code path.

So how do you implement the above in a saner way? You have two alternatives:

Add an #else clause that contains a fallback implementation. In the case above, we could, for example, prefer to use setenv if present because it has a nicer interface, and fall back to putenv if not found.
This has a disadvantage though: if you forget to include config.h or the configure script cannot correctly detect one of the possible implementations (even when present), you will always use the fallback implementation.
Keep each possible implementation correctly protected by a conditional, but add a #else clause that raises an error at build time. This will make sure that you never forget to define at least one of the portability macros for any reason. This is the preferred approach.

Following the second suggestion above, the code would get the following structure:

#if defined(HAVE_SETENV)
setenv(...);
#elif defined(HAVE_PUTENV)
putenv(...);
#else
#   error "Don't know how to set environment variables."
#endif

With this code, we can be sure that the code will not build if none of the possible implementations are selected. We can later proceed to investigate why that happened.

April 22, 2010 · Tags: portability
Continue reading (about 3 minutes)

unlink(2) can actually remove directories

I have always thought that unlink(2) was meant to remove files only but, yesterday, SunOS (SXDE 200709) proved my wrong. I was sanity-checking the source tree for the imminent ATF 0.4 release under this platform, which is always scary, and the tests for the atf::fs::remove function were failing — only when run as root.

The failure happened in the cleanup phase of the test case, in which ATF attempts to recursively remove the temporary work directory. When it attempted to remove one of the directories inside it, it failed with a ENOENT message, which in SunOS may mean that the directory is not empty. Strangely, when inspecting the left-over work tree, that directory was indeed empty and it could not be removed with rm -rf nor with rmdir.

The manual page for unlink(2) finally gave me the clue of what was happening:

If the path argument is a directory and the filesystem supports unlink() and unlinkat() on directories, the directory is unlinked from its parent with no cleanup being performed. In UFS, the disconnected directory will be found the next time the filesystem is checked with fsck(1M). The unlink() and unlinkat() functions will not fail simply because a directory is not empty. The user with appropriate privileges can orphan a non-empty directory without generating an error message.

The solution was easy: as my custom remove function is supposed to remove files only, I added a check before the call to unlink(2) to ensure that the path name does not point to a directory. Not the prettiest possibility (because it is subject to race-conditions even though it is not critical), but it works.

February 3, 2008 · Tags: atf, portability, sunos
Continue reading (about 2 minutes)

How to kill a tree of processes

Yesterday I mentioned the need for a way to kill a tree of processes in order to effectively implement timeouts for test cases. Let's see how the current algorithm in ATF works:

The root process is stopped by sending a SIGSTOP to it so that it cannot spawn any new children while being processed.
Get the whole list of active processes and filter them to only get those that are direct children of the root process.
Iterate over all the direct children and repeat from 1, recursively.
Send the real desired signal (typically SIGTERM) to the root process.

There are two major caveats in the above algorithm. First, point 2. There is no standard way to get the list of processes of a Unix system, so I have had to code three different implementations so far for this trivial requirement: one for NetBSD's KVM, one for Mac OS X's sysctl kern.proc node and one for Linux's procfs.

Then, and the worst one, comes in point 4. Some systems (Linux and Mac OS X so far) do not seem to allow one to send a signal to a stopped process. Well, strictly speaking they allow it, but the second signal seems to be simply ignored whereas under NetBSD the process' execution is resumed and the signal is delivered. I do not know which behavior is right.

If we cannot send the signal to the stopped process, we can run into a race condition: we have to wake it up by sending a SIGCONT and then deliver the signal, but in between these events the process may have spawned new children that we are not aware of.

Still, being able to send a signal to a stopped process does not completely resolve the race condition. If we are sending a signal that the user can reprogram (such as SIGTERM), that process may fork another one before exiting, and thus we'd not kill this one. But... well... this is impossible to resolve with the existing kernel APIs as far as I can tell.

One solution to this problem is killing a timed-out test by using SIGKILL instead of SIGTERM. SIGKILL could work on any case because means die immediately, without giving a chance to the process to mess with it. Therefore SIGCONT would not be needed in any case &mash;because you can simply kill a stopped process and it will die immediately as expected— and the process would not have a chance to spawn any more children after it had been stopped.

Blah, after writing this I wonder why I went with all the complexity of dealing with signals that are not SIGKILL... say over-engineering if you want...

January 16, 2008 · Tags: atf, portability, process
Continue reading (about 3 minutes)

SoC: Prototypes for basename and dirname

Today, I've attempted to build atf on a NetBSD 4.0_BETA2 system I've been setting up in a spare box I had around, as opposed to the Mac OS X system I'm using for daily development. The build failed due to some well-understood problems, but there was an annoying one with respect to some calls to the standard XPG basename(3) and dirname(3) functions.

According to the Mac OS X manual pages for those functions, they are supposed to take a const char * argument. However, the NetBSD versions of these functions take a plain char * parameter instead — i.e., not a constant pointer.

After Googling for some references and with advice from joerg@, I got the answer: it turns out that the XPG versions¹ of basename and dirname can modify the input string by trimming trailing directory separators (even though the current implementation in NetBSD does not do that). This makes no sense to me, but it's what the XPG4.2 and POSIX.1 standards specify.

I've resolved this issue by simply re-implementing basename and dirname (which I wanted to do anyway), making my own versions take and return constant strings. And to make things safer, I've added a check to the configure script that verifies if the basename and dirname implementations take a constant pointer and, in that (incorrect) case, use the standard functions to sanity-check the results of my own by means of an assertion.

¹ How not, the GNU libc library provides its own variations of basename and dirname. However, including libgen.h forces the usage of the XPG versions.

June 28, 2007 · Tags: atf, portability, soc
Continue reading (about 2 minutes)

Posts: Portability