This is the tale of yet another Bazel bug, this time involving environment variables, global state, and gRPC. Through it, I’ll argue that you should never use
setenv within a program unless you are doing so to execute something else (and even then with caveats). Before we start, I hope we can agree upfront that using global variables to pass state within a program is bad; can’t we?
Bazel is a build tool and, as such, is a program that runs on your computer—quite likely as an interactive tool on the command line. That said, Bazel uses a client/server architecture: the client handles command-line invocations which the server is responsible for executing. The server exists as an optimization to maintain state across those invocations and to avoid the cost of starting a new JVM for each command. Therefore, you can think of this client/server split as an implementation detail.
The Bazel client and server always run on the same machine and they communicate with each other using gRPC over a localhost-bound
AF_INET socket1. Because of this, and because gRPC is sent over HTTP, the connection is subject to the proxy rules defined by the conventional
no_proxy environment variables2.
But we know that the client and server always run on the same machine, so we must avoid the use of proxies when establishing the connection between the two. The way Bazel forced the communication to skip proxies was by setting the
no_proxy environment variable during its own startup to later influence the internals of the gRPC library. This, by itself, is bad enough: the code was essentially using a global variable to configure how another piece of code had to behave. But this was also buggy: the
no_proxy setting overrode user configuration (issue #6400), preventing those users from legitimately configuring proxy settings for outgoing connections.
A user decided to fix this issue and implemented code to avoid overriding a possibly-existing value of
no_proxy (PR #6399). The change made Bazel extend the value a preexisting
no_proxy variable to respect the user settings, but to configure localhost-bound connections to bypass proxies. This seemed reasonable, but was it right?
Luckily, I could overhear conversations about this bug regarding the naming of all these variables, and the thought came to mind: “why are we abusing environment variables to control what’s essentially a property of an individual connection?” Bazel should leave all these environment variables untouched and configure gRPC as needed for that one connection, explicitly.
This hint was sufficient for someone else in the team to indicate that this was, in fact, possible: we just had to set the
grpc.enable_http_proxy property on the channel to false to bypass all proxy-related settings for the Bazel client/server communication. This coworker followed up on this to implement this simpler and saner fix (PR #7368)3, which means Bazel won’t mess with your proxy settings any longer and will respect them where they matter.
Moral of the story? Don’t mutate environment variables from within your process (unless that’s to configure a new child process, and even then be careful with threads): they are essentially global variables. Or, simply: avoid
- You may wonder why the Bazel client and server don’t talk to each other via an
AF_UNIXsocket instead. I’m told that these were not supported in gRPC when the communication code was ported from whatever thing we used in the past—but they are now. Moving to a Unix-domain socket would be nice (e.g. to avoid having to deal with a shared secret) but I don’t know how this would translate to the Windows port. [return]
- All environment variables that control proxy settings are typically specified in lower case. This is likely a historical mistake which has been made worse by some recent programs trying to recognize the upper case variants. The mess is huge now, so the only safe solution, as a user, is to define both variants and ensure their values match. [return]
- Applying this fix made us discover that we had to update our dependency on gRPC (issue #2804) to a newer version because the per-channel
grpc.enable_http_proxyfeature was not available in the stale versions we depended on. Hopefully this explains why the original code abused environment variables in the first place. [return]