Let’s continue our dive into the very interesting topic of how Unix (or Linux or what have you) and Windows differ regarding argument processing. And by that I mean: how a program (the caller) communicates the set of arguments to pass to another program (the callee) at execution time, how the callee receives such arguments, and what are the consequences of each design.

NOTE: Pay attention to this post because this is interview-level material for a systems internals session!

The thing is that I’ve known that argument processing works very differently between these two systems for a long time. After all, back in 2006, I wrote the beginnings of the Boost.Process library with the main goal of learning Win32 programming and I faced those complexities. That said, it wasn’t until these days of actively developing from a Windows command line that I realized the massive impact these differences have in usability—and I want to focus on these usability issues here.

We’ll be using the following toy program, print-args, throughout our study:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    printf("argc = %d\n", argc);

    for (int i = 0; i < argc; i++) {
        printf("argv[%d] = %s\n", i, argv[i]);
    }

    return EXIT_SUCCESS;
}

This utility simply prints the values of argc and argv as received by the main() entry point of a C program, and then prints each argv element on its own line so that we can clearly see how different arguments are seen by the callee.

Argument processing on Unix

In Unix, a caller program specifies the arguments to pass to the callee as an array of strings and the callee receives such array verbatim. In other words, the caller provides an argv array and the callee sees that array exactly as it was specified. This is painfully clear if you look at the synopsis of the execv(3) family of functions:

int execv(const char *path, char *const argv[]);
int execvp(const char *file, char *const argv[]);
int execvpe(const char *file, char *const argv[], char *const envp[]);

With this in mind, consider the following program that executes print-args:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(void) {
    char* argv[] = {
        "fake-name"   // The (fake!) program name.
        "one",        // An argument with a single word.
        "two three",  // An argument with multiple words.
        "p*",         // An argument with a glob that matches print-args.
        NULL,
    };
    execv("./print-args", argv);
    perror("execv failed");
    return EXIT_FAILURE;
}

If we run this feed-args program, we see:

$ ./feed-args
argc = 4
argv[0] = fake-name
argv[1] = one
argv[2] = two three
argv[3] = p*

The output above comes from print-args, not feed-args, and there are a few details worthy of note:

  1. The program name in argv[0], which we set as fake-name, does not match the name of the binary that ran the program. This “feature” is often used to change the behavior of a program when the same program has multiple names on disk (as is the case for /bin/test and /bin/[). This also means that you cannot trust the program name that you see in the output of ps ax, for example, as it might be forged.

  2. Whitespace and other usually problematic characters (like quotation marks) are trivially preserved. If an argument includes whitespace in the argv array passed to execv(3), the argument will contain these same characters on the receiving side.

  3. Globs are not expanded. The p* above could have matched the print-args binary I had in the same directory but didn’t because there is no glob expansion happening.

All of these details are great because they mean that every single program behaves in the exact same way, and there is zero magic in between a program invocation and what the program later sees. In particular, if you use exec(3) primitives, you never have to worry about escaping and thus your invocations will be safe by default. (Corollary: never ever use system(3) if you like secure systems.)

However, if we try to replicate the invocation above in the shell without worrying about quoting, we’ll face very different results:

$ ./print-args one two three p*
argc = 6
argv[0] = ./print-args
argv[1] = one
argv[2] = two
argv[3] = three
argv[4] = print-args
argv[5] = print-args.c

Look at that: two and three now became separate arguments (obviously). More interestingly, though, p* was now expanded to match all files in the current directory that start with the letter p. How did that happen?

The shell happened. The shell is the one responsible for glob and variable expansion. The shell is the component that parses the command line looking for things to interpret and will interpret them before invoking any exec(3) primitive. It is the shell that needs escaping or quoting when there are special characters present, because the shell is trying to figure out how to split a raw string into a collection of separate argv elements.

Sure, given different shells, the quoting and expansion rules might change—but that’s only a benefit, or problem, for you, the user. As long as you understand how your shell of choice works, you know how to pass any argument to any program.

Argument processing on Windows

Things are not as rosy on Windows. They are… fascinating or, better said, horrifying: I haven’t yet figured out a case where this model would be better, so all I can think of is that this is an artifact of the MS-DOS roots.

The key difference between Unix and Windows is that, on Windows, the function to execute a program looks like this:

BOOL CreateProcessA(
    LPCSTR                lpApplicationName,
    LPSTR                 lpCommandLine,
    LPSECURITY_ATTRIBUTES lpProcessAttributes,
    LPSECURITY_ATTRIBUTES lpThreadAttributes,
    BOOL                  bInheritHandles,
    DWORD                 dwCreationFlags,
    LPVOID                lpEnvironment,
    LPCSTR                lpCurrentDirectory,
    LPSTARTUPINFOA        lpStartupInfo,
    LPPROCESS_INFORMATION lpProcessInformation
);

Some things are better in with this CreateProcessA signature (I do buy into the arguments given in the “A fork() in the road” paper, which claims that the fork+exec combination, even if elegant, is a hack that needs to die). But others, like the single command line string in lpCommandLine, are definitely not better.

IMPORTANT: Let me reiterate that because it’s key: on Windows, the caller program specifies a single string with all arguments to pass to the callee, and the callee receives such single string verbatim. The consequence of this is that the callee must do argument parsing on its own.

“But, Julio, I have written plenty of C programs on Windows and they all started with a common-looking main() function that receives the argc and argv pair! This cannot be true!" Ha, ha, you have been fooled by the msvcrt C runtime then.

As you can imagine, having each program implement argument parsing on its own would be terrible for usability. To mitigate this problem, the system provides common code in the standard libraries to perform argument parsing in a relatively consistent manner. If you are writing a Windows-native app, this takes the form of an explicit call to CommandLineToArgvW, but if you are writing a traditional C program with its main() function (not WinMain), then the C runtime performs argument processing on its own before handing control to the main() function.

However, given that this happens within the C runtime startup code… it is entirely possible for a program to bypass this logic and do its own crazy argument parsing. And even the msvcrt code provides various options to customize how argument handling works.

The consequences of this are dramatic as we shall see in the following two examples. Note that the point of this post is to focus on these consequences and not to look at the finest details on how Windows implements argument parsing. For that, I’d refer you to the “A Better Way To Understand Quoting and Escaping of Windows Command Line Arguments” post on the Windows Inspired blog and the official documentation on “main function and command-line arguments”.

Glob expansion in Windows

Let’s look at the first roadblock that we might encounter: glob expansion (or wildcard expansion in Windows parlance). For this, let’s first see what led me down this path.

The Windows terminal is… rudimentary. I had never considered using a “better” ls tool until now but, being aware that the exa tool exists, I tried to install it on my Windows machine. The upstream code doesn’t compile on Windows, but there is an unofficial port (search through the issue tracker) that mostly does.

A quick try seemed to work great:

C:\Users\jmmv\test>dir
 Volume in drive C is Local Disk
 Volume Serial Number is 6C43-48F3

 Directory of C:\Users\jmmv\test

2020-11-01  19:24    <DIR>          .
2020-11-01  19:24    <DIR>          ..
2020-11-01  19:24                 0 aaa
2020-11-01  19:24                 0 abcdef
2020-11-01  19:24                 0 zyx
               3 File(s)              0 bytes
               2 Dir(s)  907,654,946,816 bytes free

C:\Users\jmmv\test>exa
aaa  abcdef  zyx

But… soon, things appeared broken enough to make this unofficial port of exa to Windows useless:

C:\Users\jmmv\test>exa a*
"a*": The filename, directory name, or volume label syntax is incorrect. (os error 123)

Simply put, glob expansion isn’t working, but it definitely does on Linux if you run this tool there. Now, if you have read this post up to here, it should be obvious why that’s the case: the Windows shell took everything after the exa command name and passed it as a string to the program. The program then tokenized the string, trying to simulate an argv array… but did not implement glob expansion (because there is no need for it in a Unix world). Hence the program saw the unexpanded glob and failed.

Yikes. We can observe this same behavior if we run our print-args tool from above with a p* pattern that we expect to expand to the program’s own files:

C:\Users\jmmv\test>.\print-args p*
argc = 2
argv[0] = .\print-args
argv[1] = p*

Here is where the common system code I mentioned above comes to the rescue: to prevent you from having to implement glob expansion on your own. This common code comes from msvcrt. If you compile the program like follows, as documented in the official “Expanding Wildcard Arguments” page:

C:\Users\jmmv\test> cl print-args.c /link setargv.obj

Then glob expansion works as you imagined:

C:\Users\jmmv\test>.\print-args p*
argc = 4
argv[0] = .\print-args
argv[1] = print-args.c
argv[2] = print-args.exe
argv[3] = print-args.obj

… more or less… because you cannot quote the globs as you’d on Unix:

C:\Users\jmmv\test>.\print-args "p*"
argc = 4
argv[0] = .\print-args
argv[1] = print-args.c
argv[2] = print-args.exe
argv[3] = print-args.obj

C:\Users\jmmv\test>.\print-args "p\*"
argc = 2
argv[0] = .\print-args
argv[1] = p\*

C:\Users\jmmv\test>.\print-args p\*
argc = 2
argv[0] = .\print-args
argv[1] = p\*

C:\Users\jmmv\test>.\print-args p\\*
argc = 2
argv[0] = .\print-args
argv[1] = p\\*

C:\Users\jmmv\test>.\print-args 'p*'
argc = 2
argv[0] = .\print-args
argv[1] = 'p*'

and I have yet not figured out how to escape the special characters.

Anyhow. You see: a program must know that argument expansion might be valuable in its invocation, and thus must explicitly request such feature from the system at build time or implement its own processing. And in the exa case I hit, the program failed to pull in the necessary code.

Raw command lines in Windows

Let’s look at an even more interesting example, which is something I randomly found when executing the FIND command (the grep equivalent). If we run it as its help message says, double-quoting the string to search for, things work great:

C:\Users\jmmv\test>type print-args.c | find "include"
#include <stdio.h>
#include <stdlib.h>

But let’s say that you “forget” to quote the string, because why would you if it’s a single word:

C:\Users\jmmv\test>type print-args.c | find include
FIND: Parameter format not correct

Uh, excuse me? I’m convinced many other commands take strings without double quotes when they aren’t strictly needed (CD being the obvious example when the path doesn’t contain spaces). So why is FIND so finicky about this?

Well, the answer is probably “because MS-DOS did it that way”, and the mechanism by which this can be implemented is obvious: the FIND process receives the raw command line as provided by the shell and so can explicitly insist on the double quotes. But… how is it doing that? As we saw above, the msvcrt runtime does argument splitting before handing control off to main(), and said common code strips out surrounding double quotes. So how is FIND accessing the raw command line?

Disappointingly, I don’t know yet. I know how it could be implemented, but I do not know how exactly FIND does it. I spent a couple of hours tracing through FIND’s execution within WinDbgX (with which I had zero experience before this exercise). I looked at ReactOS’s FIND (which doesn’t seem to reproduce this “bug”) and msvcrt’s implementation without success because I could not confirm any of the leads I encountered. I even looked at the MS-DOS 2.0 assembly version of this code, which confirmed that it did its own argument splitting. But the only thing I can conclude from this investigation is that FIND’s main() receives the full command line as a single argument… and something is happening pre-main to disable splitting, but I can’t quite tell what.

What I did find were more options to configure the msvcrt, and among these is one to disable all argument processing. If we use it in our program:

C:\Users\jmmv\test> cl print-args.c /link noarg.obj

Then…

C:\Users\jmmv\test>.\print-args
argc = 0

which makes main() blind to any input arguments and I’m not sure how that would be useful. However, there is the magic _acmdln global variable that does include the raw command line as received by the process, so you could use that to perform any kind of manual splitting if you had to.

Verdict

The Unix simplicity wins hands down in this case. With its simple approach, the user gets a consistent experience across all programs. On the other hand, Windows' argument processing is crazy talk riddled by what I think are backwards-compatibility hacks.

But let’s not forget that Unix is similarly crazy when we get to options parsing: options are plain text arguments and each program must implement its own variant on how they are interpreted. And that’s why you end up with programs that take long options prefixed with a single dash and other programs that need two dashes, for example. PowerShell has gotten these details right, but you must stay within the realms of cmdlets for that to be true.

Want more posts like this one? Take a moment to subscribe!

Enjoyed this article? Spread the word or join the ongoing discussion!