One of the things I’m most proud of the Rust web services I have written is how I can run their tests with zero setup and within milliseconds, all while making me confident that main can always be shipped to production. I’ve previously touched upon how this all works in other articles, but it’s time for a deep dive.

To make things specific, I’ll be describing the testing infrastructure of EndTRACKER, the EndBASIC Service, and the sample key/value store app of III-IV. These services are all structured in three separate layers, and I’ll be covering the testing strategy for each of them.

But before getting into how each layer is exercised on its way to production, let’s talk about external dependencies… because dependencies are the root of all evil when it comes to the usual poor testing strategies you may encounter.

A blog on operating systems, programming languages, testing, build systems, my own software projects and even personal productivity. Specifics include FreeBSD, Linux, Rust, Bazel and EndBASIC.

0 subscribers

Follow @jmmv on Mastodon Follow @jmmv on Twitter RSS feed

Interacting with dependencies

Pretty much any web service relies on other services, which I’ll call dependencies. These include databases, queuing systems, distributed storage, remote logging… you name it. The list of dependencies may be long, and their direct use in tests is typically where the friction in testing comes from: most service implementations are unable to stub their dependencies out, so the developers end up having to run the real dependencies to execute any test.

If you have worked on the development of any modern web service, particularly in a corporate environment, you’ve witnessed the issues that running real dependencies causes:

  • You have had to carefully set up your development environment with the right versions of tools and services, wasting hours (or days!) of productive time.

  • You have had to troubleshoot test failures caused by problems in your development environment. Any small deviation from the blessed configuration can lead to mysterious problems and you are on your own to figure them out. “Works on my machine!” is a common excuse to not get involved in solving a coworker’s issue.

  • You have had to rely on overly powerful machines to run the tests because all the dependencies are huge and consume large amounts of RAM and CPU. After all, each dependency assumes it will be running on its own server(s) and is likely written in a language different from all other dependencies, thus requiring its own heavy runtime. Talk about waste, huh.

  • You have had to patiently wait for many minutes every time you run tests while Docker downloads multi-GB images, starts the dependencies, and waits for them to be ready to serve requests after (re)starting from scratch.

  • You have had to suffer from flaky tests because the connections to the dependent services sometimes fail or the state of the dependent services has somehow been polluted by other tests.

  • You have had to witness entire teams being spun up and funded to deal with slow CI runs and flaky tests, all while spending countless machines to support these resource-hungry tests.

These issues are all avoidable with proper upfront care. I’m convinced that most developers want to do the right thing, but: one, many times they don’t know what the right thing even is; and, two, the pressure to “launch and iterate” often comes with an empty promise that there will be time “later” to address past cut corners. And you well know that automated tests are often ignored until they become truly necessary—that is, when the product is crashing left and right and customers are threatening to leave—at which point a solid testing infrastructure is non-existent and cannot be easily retrofit.

A key foundation to avoid these problems is to architect the system in a way that puts all external dependencies behind interfaces from the ground up. These interfaces then let you plug in different implementations of the dependencies such that most tests can skip using the real dependencies. In other words, the key foundation is Dependency Injection (DI). And no, I’m not talking about fancy DI frameworks: all I’m talking about is the very basics of defining interfaces or traits and passing instances of those to constructors and functions.

Now, of course, there is a balance between A) fast and deterministic tests that rely on fake services and B) slow and accurate tests that rely on real services: the more you stub out real dependencies, the less accurate tests become. The idea, though, is to have the choice to pick one or the other on a test by test basis depending on the scenario to validate. And to make that choice, the system architecture must be in place to support it from the very beginning. With that in place, you can come up with the best testing strategy for each scenario, and you can choose how much of test collateral has to run every time you run tests and how much of it can be postponed to PR merge time or nightly runs.

In my services, my goal is to make the vast majority of tests run with a simple cargo test after a git clone. No configuration necessary. A small subset of tests do talk to the real dependencies and require configuration but, while these can run locally, I rarely need to do so because they are automated to run in CI at PR merge time.

Let’s dive into the different layers of the architecture to see how these ideas play out. The layers are one for database access, one for business logic, and one for REST handling. You may want to read “Introducing III-IV” and “MVC but for non-UI apps” beforehand, both of which describe the general architecture of these services.

Database layer testing

My services use PostgreSQL in production. While setting up a local instance of this database is not difficult and it doesn’t consume any meaningful resources when idle, it’s still far from the zero setup experience I strive to achieve. So the first thing I had to do was hide the database queries behind an interface. The basic building blocks look like this and can be found in the iii_iv_core::db module:

/// Abstraction over the database connection.
#[async_trait]
trait Db {
    /// Type of the transaction wrapper type to generate.
    type Tx: BareTx + Send + Sync + 'static;

    /// Begins a transaction.
    async fn begin(&self) -> DbResult<Self::Tx>;
}

/// Common operations for all transactions.
#[async_trait]
trait BareTx {
    /// Commits the transaction.
    async fn commit(mut self) -> DbResult<()>;
}

The Db trait exposes a generic mechanism to open a transaction against a database via its begin method. The returned transaction type is parameterized on Db::Tx, which has to be a subtrait of BareTx. In turn, BareTx represents the common operations one can do with a generic transaction but does not have any domain-specific knowledge.

Each web service is responsible for supplying its own transaction trait that extends BareTx with the operations that make sense in its domain. For example, here is how the sample key/value store service that ships with III-IV exposes the database operations needed to implement the key retrieval and storage operations. Note that the upstream code uses the name Tx for this trait, but I’ve renamed it to KVStoreTx in this text for clarity:

#[async_trait]
trait KVStoreTx: BareTx {
    /// Gets the current value of the given `key`.
    async fn get_key(&mut self, key: &Key) -> DbResult<Entry>;

    /// Sets `key` to `entry`, which includes its value and version.
    async fn set_key(&mut self, key: &Key, entry: &Entry) -> DbResult<()>;

    // ... and several more ...
}

There is nothing in these interfaces that points to database-specific behavior, which is intentional. The only thing that client code is allowed to do is create a transaction and call the business-specific methods on it, without knowing what the transaction is talking to. Going back to the example above, this snippet would fetch the value of a key from the key/value store:

let mut tx = db.begin().await?;
let value = tx.get_key(key).await?;
tx.commit().await?;

With the service-specific transaction type in place (KVStoreTx), the service is also responsible for supplying separate implementations of it for all databases the service wishes to support. As mentioned earlier, this means providing a variant for PostgreSQL for production usage. But what about tests? Tests could use their own database-less implementation—for this trivial example, a HashMap would suffice—but going this route becomes tricky once you want to reproduce more realistic OLTP database behavior, especially when concurrent operations take place. The other obvious alternative is to use SQLite: a real database that requires zero configuration, which fits the perfect bill for unit tests.

As a result, I end up with the following types in the system:

  • A generic PostgresDb (provided by iii_iv_postgres) and a service-specific PostgresKVStoreTx for production.
  • A generic SqliteDb (provided by iii_iv_sqlite) and service-specific SqliteKVStoreTx for tests.

Implementing the same database queries against two different database systems is annoying indeed, but forcing myself to do this keeps me honest in maintaining true abstractions. However, it is critical that these implementations behave as similarly as possible and, to guarantee this, I write extensive unit tests in a separate db/tests.rs file. These tests look like this:

async fn test_simplified_get_after_set<D>(db: D)
where
    D: Db,
    D::Tx: KVStoreTx,
{
    let mut tx = db.begin().await.unwrap();

    let key = Key::new("the-key".to_owned());
    let entry =
        Entry::new("insert".to_owned(), Version::from_u32(1).unwrap());
    tx.set_key(&key, &entry).await.unwrap();
    assert_eq!(entry, tx.get_key(&key).await.unwrap());

    tx.commit().await.unwrap();
}

As you can see, each tests is parameterized on a D type. The D type is an implementation of the Db trait presented earlier, whose only purpose is to yield new transactions of its inner D::Tx type based on a pre-established connection. The D::Tx type is mapped to the domain-specific KVStoreTx type so that tests have access to the primitives to be tested. Notably, though, the tests have no way of knowing which database they are talking to.

With these generic tests in place, the question is: how are they executed against the individual database implementations? The db/postgres.rs and db/sqlite.rs modules of each service define #[test] entry points for each test. These entry points are thin wrappers for the common test code in db/tests.rs and their sole purpose is to establish a connection to the database and then delegate to the test implementation. Basically, each wrapper looks like this:

/// This is the specialization of a test for SQLite.
#[tokio::test]
async fn test_simplified_get_after_set() {
    // Create a connection to the SQLite in-memory database.
    let db = iii_iv_sqlite::testutils::setup::<SqliteKVStoreTX>().await;

    // Delegate to the common test code.
    crate::db::tests::test_simplified_get_after_set(db).await
}

/// This is the specialization of a test for PostgreSQL.
///
/// Note how the test is marked `ignore`.  We'll see why that is later on.
#[tokio::test]
#[ignore = "Requires environment configuration and is expensive"]
async fn test_simplified_get_after_set() {
    // Create a connection to PostgreSQL using the configuration specified via
    // environment variables.  The connection is set up to use a temporary
    // schema so that tests are isolated from each other and don't leave garbage
    // behind.
    let db = iii_iv_postgres::testutils::setup::<PostgresKVStoreTx>().await;

    // Delegate to the common test code.
    crate::db::tests::test_simplified_get_after_set(db).await
}

For a long while, this is actually how the test wrappers looked like and… they were written by hand. At some point, I grew tired of copy/pasting these snippets over and over again and invested a wee bit of time learning how to leverage macros to cut down the repetition. It wasn’t as difficult as I imagined. You can see how this works in practice in the sample key/value store tests and their instantiation for SQLite.

Driver layer testing

Let’s jump one level up and look at the testing approach for the driver layer.

The driver layer of each service typically exposes a single Driver type. The Driver maintains the state of the application and provides entry points for all REST operations, usually with a 1:1 mapping between REST API and driver method.

To instantiate a Driver, all service dependencies are injected at creation time. Here is an example of how the Driver constructor looks like for the EndTRACKER data plane service, which is more interesting to analyze than the driver for the sample key/value store:

pub(crate) fn new(
    db: D,
    clock: C,
    geolocator: G,
    abuse_policy: A,
    queue_client: Client<BatchTask, C, QD>,
) -> Self {
    Self { db, clock, geolocator, abuse_policy, queue_client }
}

See? A trivial constructor that does no work, as it shall be done. Neat… but what are all these type parameters? These type parameters are what allow injecting the different implementations of each dependency into the service for testing purposes.

Now, why are they are type parameters? Simply because I wanted to try using static dispatch, and… things have gotten unwieldy. All references to the Driver type in impl blocks look like this awful chunk:

impl<A, C, D, G, QD> Driver<A, C, D, G, QD>
where
    A: AbusePolicy<D::Tx> + Clone + Send + Sync + 'static,
    C: Clock + Clone + Send + Sync + 'static,
    D: Db + Clone + Send + Sync + 'static,
    D::Tx: DataTx + From<D::SqlxTx> + Send + Sync + 'static,
    G: GeoLocator + Clone + Send + Sync + 'static,
    QD: Db + Clone + Send + Sync + 'static,
    QD::Tx: ClientTx<T = BatchTask> + From<QD::SqlxTx> + Send + Sync + 'static,
{
    // ...
}

If I had to write this monstrosity just once, it could be tolerable. But because I split the implementation of the Driver across different files to keep them short… this chunk is repeated across many files and keeping them in sync is a humongous hassle. I’m… not happy. Fear not though: the alternative is to use Arc<Mutex<T>> everywhere with T being a type alias over the trait, which keeps the noise down significantly. Mind you, I used to do this and I’m not sure the switch to static dispatch was worth it. But I digress…

Because all these types are parameterized, it means I can instantiate a Driver and back it by different implementations of each dependency. For example:

  • The db can be backed by PostgreSQL in production and SQLite in tests as I have already covered in the database layer section.
  • The clock can be backed by a SystemClock that returns the system time, and also by a MonotonicClock that exposes fake (and deterministic!) time.
  • The geolocator can be backed by an AzureGeoLocator that talks to Azure Maps, and also by a MockGeoLocator that returns pre-configured results and errors.

… and similarly for any other resource needed by the Driver.

This, once again, allows: writing super-fast non-flaky unit tests because they do not reach out to real resources; running the tests with zero configuration; and avoiding the need to spawn resource-hungry dependencies on the local machine.

So what do tests in the driver layer actually test? These tests are mostly responsible for validating the business logic. They cover all happy paths but, critically, they also cover all error paths I can think about—something that’s made trivial by the use of fake dependencies. These tests do not cover any HTTP interactions though; for those, we have to move up one layer.

REST layer testing

The REST layer is the one interfacing with the user of the web services via the network. This is the layer where requests are deserialized, validated, routed to the driver, and where responses or errors are serialized back to the user with the correct HTTP status codes.

This layer is currently written using the axum web framework, whose fundamental building block is the Router. Each web service creates a new Router and registers all API endpoints plus an instance of the Driver that gets passed to the API handlers as a state parameter. Take a look at the sample key/value store router creation.

Because the Driver is injected into the REST Router, it can be parameterized with all the non-production dependencies as described earlier—and it is. Now, the question is: what do the tests of the REST layer look like and what do they do?

For these, I used to spawn a local instance of the HTTP server, listening on a random unused port, and then made the tests call HTTP endpoints over the loopback interface with the reqwest crate. Once I moved to axum from warp, things improved. I could start relying on the one-shot testing feature exposed by this framework, which allows calling the router endpoints without going through the network. Not a revolutionary change, but a nice improvement for simplicity indeed.

To test this layer, I apply the builder pattern to define test scenarios. With this idiom, I can capture the parameters to an API call and the expectations of what it should return in a declarative manner. Here is one example of a test for the “put key” operation of the sample key/value store:

fn route(key: &str) -> (http::Method, String) {
    (http::Method::PUT, format!("/api/v1/keys/{}", key))
}

#[tokio::test]
async fn test_create() {
    let context = TestContext::setup().await;

    let response = OneShotBuilder::new(context.app(), route("first"))
        .send_text("new value")
        .await
        .expect_status(http::StatusCode::CREATED)
        .expect_json::<Entry>()
        .await;
    let exp_response = Entry::new("new value".to_owned(), Version::initial());
    assert_eq!(exp_response, response);

    assert_eq!(exp_response, context.get_key("first").await);
}

In this test, TestContext is a container that helps set up the Driver with fake dependencies and instantiates the Router around it. The most interesting part is the use of my own OneShotBuilder, which implements the builder pattern for one-shot calls. With this at hand, the test says that it has to send a specific text document to a specific PUT endpoint and then expects that the HTTP API call returns a CREATED status code with a valid JSON response of type Entry. Finally, the context.get_key call is a helper method that pokes directly into the database to see if the key is set, which validates the side-effects of the API call on persistent storage.

The tests in this layer are responsible for validating anything that’s specific to the interactions with the user over HTTP, but these tests assume that both the driver and database layers work correctly. This is why these tests do not validate in excruciating detail all the possible corner cases that we can face in the driver or its interactions with external dependencies: the driver tests have that responsibility.

Fidelity problems

Alright, so that’s the majority of the current testing approach. We have seen how the foundational database layer is architected to support dual implementations via PostgreSQL and SQLite, how other supporting services are modeled with the same duality, and how the driver and REST layers leverage the in-memory / fake implementations to provide logical test coverage at all layers. As is, these provide very good coverage of the functionality of the web services and give me almost full confidence that main is release-quality at any given time.

But there are still some risks.

The major risk in using SQLite for tests vs. PostgreSQL for production is that they are very different databases. Sure, they are both OLTP SQL databases, but their SQL languages are distinct dialects and SQLite is in-process whereas PostgreSQL runs on a server. Dealing with slightly-different SQL queries is easy because the differences are obvious, but there are subtle differences in behavior that influence how calls behave, especially in error conditions. For example: you will never experience a “maximum connections reached” error with SQLite, but you surely will with PostgreSQL. Similarly, SQLite might give you trouble with concurrent writes while PostgreSQL won’t. There are also risks when replacing the clock with a fake one, or when replacing other services such as the Azure Maps or SMTP clients with stub implementations.

Now, you’d say: “Well, you are facing these fidelity issues because you don’t test against the real thing, duh. If all your tests used the real dependencies, then you’d be fine!” Except… that’s not how testing works. When writing a test, you can do two things:

  • You can write tests for the “known knowns”: the happy and failure paths that you know can happen.

  • You can write tests for the “known unknowns”: the scenarios you think might happen but for which you have no good answers and you need to discover what their behavior is.

These are easy to model in tests, and if you are writing these tests, then you can make sure your real and fake dependencies behave in the same way.

But there is another class of failures that you cannot test for with unit tests: the “unknown unknowns”. These are the situations you do not anticipate, and because you do not anticipate them, you cannot write tests for them. It doesn’t matter that you are using the real dependencies or fake dependencies. If you cannot imagine these scenarios, no test will cover them. And this is where I have encountered interesting bugs in production before.

Real system testing

Thus, even though most of the testing I do in the web services is fast and requires no setup, there is still a need to validate “the real thing”: that is, the service talking to the real dependencies under real world conditions and usage.

To accomplish this, I do two things.

The first is I write tests that actually talk to the real services (oops). These tests all require manual configuration and are marked with #[ignore] as we saw earlier so that a cargo test won’t pick them up by default. The CI jobs are configured to supply the right settings for these tests, and the PR merge checks forcibly run these ignored tests. It is also possible to run these tests locally by manually configuring the environment in a config.env file and using a trivial test.sh script that hooks things up with cargo test, but as said earlier, I rarely have to do so.

The second is I deploy to a staging environment and do manual testing on it. Every commit merged into main gets automatically deployed to a staging instance of the service (which is made easy by Azure Functions’ slot feature), and I do some manual validation that things work. I could automate this testing, of course, but it is something that can still wait.

Is this enough?

I know this is an overly simplistic view of the world, and that this testing approach can let some subtle bugs slip through. It has actually happened before. But thanks to this testing approach—and Rust’s type system, whose help cannot be overstated—every new feature I have launched has worked on the first try and the web services have kept happily chugging along the years. This is critical to me because these web services are just side projects of mine, so I must ensure they cause me the least trouble possible in production.

Finally, let me clarify one thing: I’ve been talking about “unit tests” throughout this post, but if we want to be pedantic, almost nothing of what I described are pure unit tests. Every test at every layer relies on the layers below it to behave correctly: the service’s own code is never stubbed out so, for example, a test for the REST layer will run code in the driver and database layers. The only thing that’s stubbed out are the connections to external services. I believe this style of testing provides much more realistic scenarios at the expense of making them more subtle to breakage when code changes.

And that’s it for today. If you liked this post, you may also enjoy “Unit-testing a console app (a text editor)” from over 2 years ago. It is then that I came up with the idea of using the builder pattern to define tests, and that idea still proves very useful to this day.