My latest developer productivity rant thesis is that integration tests should be written in the exact same language as the thing they test. Specifically, not shell.

This theory applies mostly to tests that verify infrastructure software like servers or command line tools. It is too easy to fall into the trap of using the shell because it feels like the natural choice to interact with tools. But I argue that this is a big mistake that hurts the long-term health of the project, and once trapped, it’s hard to escape.

Mind you, I’ve made this mistake in the past countless times and I’ve observed pretty much every other infrastructure team make the same mistake. I’ve also observed teams make a nicer choice by using Python instead of the shell, but the problems that eventually surface are the same.

The core of my arguments is that you should stick to the language your team is familiar with. Choosing a different language for the integration tests on the premises that you “need” a scripting language is a flawed argument bound to cause trouble. And even if you have a separate testing team, you should strive for homogenization because, otherwise, the tests will live in the testing team’s realm and your developers will not want to do anything with them.

To elaborate on this thesis, I wrote a rather long document analyzing the specific case of the project I work on, the Bazel build system. In this project, our integration tests have traditionally been written in shell. I propose that we rewrite them in Java, which is the primary language used in the core of the project, and I prove that this doesn’t necessarily have to make our tests “too verbose”.

I have split the document in two pieces: one motivational and one practical. Below, you can find the motivational part of the document. I’ll share the other part in a follow-up post.

Disclaimer: These are my arguments and this view is not necessarily endorsed by everyone in the team. But, so far, I’ve received positive feedback from various individuals 😉.

A case for writing integration tests in Java: motivation
Fixing Bazel’s reliability by treating tests as production-grade code


The bulk of our integration tests are written in shell. As of February 2018, Bazel’s codebase is 3.6% shell or 40,000 lines and Blaze’s codebase is X%1 shell or 130,000 lines2. These are small percentages… but a lot of code.

To make matters worse, the health of these tests is questionable. Leaving aside that they are fragile and slow to run, the foundations on which they are built are rudimentary: to someone knowledgeable in shell, there are plenty of subtle bugs in the code—and even ShellCheck, with all of its limitations, finds 7752 problems. As a simple example, enabling the shell’s “strict mode” features causes 48% of the tests (137) to fail. Some failures are caused by leftover pieces of code that don’t work any more, but others are much more concerning: an “assert is not defined” message coming out of a test is a sign that the test has never actually verified anything! Similarly, test setup is so convoluted that it’s subject to the tragedy of the commons and deteriorates over time.

The problem is that there are no incentives to improve the status quo: the shell is only used for tests and tests are sometimes seen as an afterthought or overhead. Combined with the fact that the shell is an alien language to most, integration tests don’t receive the care they deserve. As a result, the codebase degenerates over time. And we can’t blame anyone for this: the shell is an awful and arcane programming language and it’d be a waste of time to train all of us to become experts.

In this document, I propose that we write integration tests in the same language that Bazel is written in, the language that we are all familiar with, and the language that we all want to master because of our foundations: Java. I will dive into the problems in detail and try to convince you that this (and not Python) is a good idea.

What’s wrong with the shell?

We are not proficient in shell and it’s not worth to train us

Bazel is primarily written in Java.

Java is a very expressive and robust language, which allows us to reliably ship Bazel to thousands of engineers over and over again with minor hiccups. Writing Java with a good IDE is a pleasure. Java is a well-supported language within Google. Java generally ranks as the topmost popular language in the world. And because of all these, our team is proficient in Java: half of our engineers have Java readability, and it’s probably fair to say that everyone actively wants to improve their mastery of this language.

Other than for small glue pieces (e.g. the C++ client) and tests, the team does not routinely write code in any other language… yet everyone has to write integration tests at some point or another.

This requirement to write code in shell causes some change list (CL) authors to acknowledge, at review time, that they don’t know the language. Sometimes, CL authors disregard reviewer comments that suggest the adoption of common shell idioms in favor of “simplifying their code” so that they can understand it later. More frequently though, CL authors choose not to follow the suggested best practices because the current code is so broken that being consistent with it is better than being different (and that’s fair).

These are not signs of good engineering: if we use a language in our project in large quantity, we have to be able to commit to mastering it. But this is not a reasonable proposition: mastering the shell takes a lot of time—time that could better be spent elsewhere—and frustration. Furthermore, the shell is an ancient language that’s not really worth learning in depth as a career move: we can all become better programmers if we learn more modern languages and techniques instead.

Global state obscures what’s happening

The primary touted benefit of writing tests in shell is that calls to the bazel binary look the same as the user would type them. This is a good thing because the tests clearly mimic user behavior. Or do they?

This benefit is an illusion: even though the calls look like what the user invokes, they are not: the common code for the tests modifies a ton of global state (creates bazelrc files with test contents, creates mock tools, and even aliases what bazel does), so when the tool is finally invoked, it’s actually not what the user would get. Given that these are changes to global state, it’s extremely hard to track down what has changed and how, which makes understanding test behavior difficult.

Of course this is a fixable problem: we could come up with better abstractions in the testing framework, but doing so in shell is futile because of the reduced number of primitives available to construct high-level abstractions and the easiness by which global state leaks.

What is the shell anyway? A non-portable beast, that is

The “shell” is a very broad term and, when treated as such, is full of pitfalls. I’ve been avoiding to mention the fact that Bazel requires Bash, not a standard shell… so it’s time cover this.

What variant of the shell are we talking about: POSIX, Korn, Bash… Zsh? Each implementation supports different features, some of which are obvious ([ vs. [[) and some of which are subtle ([’s support of == or lack thereof). It’s hard to know what’s portable and what is not.

Do we care about the shell version? Different versions implement different features and they come with different bugs. The shining pain point here is macOS’s Bash: it’s an ancient version that contains serious bugs, at least one of which prevents writing shell with “strict mode” turned on because well-formed code fails to parse under certain circumstances.

And what about the supporting tools? The shell is a cryptic language with a limited amount of built-in functionality. The vast majority of actions require invoking external commands (e.g. cp, grep, find), which is slow and prone to portability problems: where do the tools live? What flags do they support? Do the flags behave the same across systems? … are they even external? E.g. echo is both a built-in and an external tool, and they are not guaranteed to be compatible within the same system!

All of these make writing readable and robust shell code exceeding difficult. And they also make writing portable shell almost impossible: one must have excellent knowledge of all the shell variants and it’s increasingly hard to become knowledgeable in this area: the world in general assumes that “Bash and shell” are synonyms so there is little documentation about the differences (and the documentation that exists is not obvious).

Not for Windows

The previous section covered portability across shell variants… but what about other platforms? The obvious portability problem of the shell is that it’s alien to Windows—and we want Windows to be a first-class platform in Bazel, don’t we?

It’s certainly possible to run shell scripts on Windows, but none of the available approaches is “native”—yet we want to ship a Bazel binary that integrates well with the native Windows ecosystem. If, at some point, we want to enable Bazel developers to fully develop natively on Windows, the presence of shell will be harmful. Not to mention that the shell is slower on Windows than on Linux.

By the way: almost-all recent commits to src/test/py/bazel/ (note the py in there) come from people that have worked on the Bazel Windows port. This should be telling.

On a lower order of magnitude, the same problem arises in other platforms: Bash is not available by default under non-Linux Unix-like operating systems (such as FreeBSD, which we also intend to support) due to licensing and ideological reasons: in those systems, /bin/sh continues to be a standards-compliant shell interpreter, not Bash, so it cannot run the code that we currently have.

What’s wrong with our integration tests?

No use of “strict mode”

Despite of the problems it has, enabling the shell’s strict mode at the beginning of any shell script is in general a good idea because it prevents a lot of problems.

Can you guess what the issue is though? Our integration tests do not use this feature and enabling it causes all kinds of problems to surface: a trivial attempt to add this line to our unittest.bash file triggered 137 test failures out of 282 tests. The implication is that some of our tests are not doing what they seem to be doing and may be reporting success even if already broken.

Spaghetti foundations

Try to figure out what happens before an integration test (the thing within your test_foo function) starts running. Untangle at least 6 shell file includes, most of which modify the global environment and call functions that touch the file system. I’ll wait.

Now try to change this “framework” to cover a new need of yours. Without breaking anything.

Understanding what sets up mock tools, what creates bazelrc files, what creates a project-like client, why the environment is changed in certain ways, how the execution log is handled, how assertions are implemented, etc. is a very difficult thing to do. Navigating this maze is complicated, and when it’s necessary to do so, slows down any kind of change to our code.

Code quality

Coding style guidelines exist for a reason and linters exist for a reason. Smart people have put a lot of time coming up with procedures for writing readable and maintainable code and have written tools to ensure that the code lives up to those standards.

Yet our shell tests do not pass any of the two. Granted: the foundations of our shell tests date back to when these tools did not exist—but we have put no effort in resolving these issues after the fact. This is problematic because, when writing new code, it is accepted practice to pile things on the existing mess instead of paying the technical debt because it’s too expensive to do so. Not paying the debt, however, causes every single CL review to be polluted by the existing 7752 ShellCheck warnings, some of which become part of the CL discussion (without positive resolution) and drag everything along.

As a result, the problems don’t get fixed, and as our product becomes more and more complex, so do our testing needs. And with a bad foundation, the only thing we can expect to adapt to those extra needs in a short time-frame are shiny new hacks.

Lack of reliability

As it has been hinted above, the complexity and poor foundations of our tests cause them to not be as reliable as they should. It should not be possible for statements in our tests to return errors that do not trigger test failures, because in most cases those are sign of test bugs. But that’s the general behavior today.

I’d argue that the integration tests are our most precious asset. They are what prove that Bazel does what we advertise it does, and they are what allows us to ship releases as “often” as we do.

The tests ought to be of the same quality than the code they are testing—or, dare I say, more, because the integration tests tend to outlive the specific implementation they validate by definition.

Massive amounts of data dependencies

Our integration tests pull in a large amount of data dependencies whether they need them or not. An internal automated analysis tool claims that 54% of the data bytes (on the order of GBs) we pull into the tests go unused at runtime. Unused data dependencies are harmful because they increase the build time of our targets and they increase the execution time of those tests: data dependencies must be transferred to the remote executors and staged on disk.

While this is orthogonal to the issue discussed in this document, I felt compelled to mention it because a rewrite of the testing framework in another language is the perfect chance to tackle this long-running problem.

Why Java and not <favorite-language>?

All of the reasons above show that the shell is a poor language to implement our integration tests and argue that what we have is unsustainable in the long term. If that’s the case, what do we replace the shell with?

The temptation is strong to move to another scripting language such as Python but I’d be extremely wary of going this route. It’s true that Python is a nicer language than shell. However, it’s a different language than the one the team knows at large: we currently only have 30,000 lines in Python vs. a total of 130,000 in shell.

For the same reasons exposed above, the team is currently not proficient in Python. And for the same reasons exposed above, it’s neither desirable nor feasible to make the team master the Python language only to write “nicer” integration tests. Trying this approach will drive us to the exact same set of problems we are facing today regarding code quality and maintainability, only in a different language.

What are we left with? Go? C++? Something else? No. The answers are simple:

  1. Let’s use the exact same language we use to develop our product.

  2. Let’s get rid of the mentality that integration tests are “boilerplate”.

  3. Let’s get rid of the mentality that test code is less important than production code.

  4. Let’s use Java, and let’s write a well-engineered framework that we can trust.

With a proper backing framework, the integration tests don’t have to be more complex to write than the existing shell tests and the benefit is that we’ll have dependable code. Furthermore, the development process of the tests will match that of the production code: same IDE, same refactoring abilities, same code inspections, same sanity checks…

And to prove this, I’ll guide you through a case study, through a proof of concept, and through an implementation plan… in the next episode!

  1. Blaze’s total size is confidential, hence the redacted percentage. ↩︎

  2. The major reason behind Blaze having much more shell code than Bazel is because most integration tests have not yet been open-sourced. What I propose in this document should make doing this more feasible. ↩︎