My January links recap included the “Phantom Types” article by David Soria Parra. In it, the author briefly touches upon the “new type” idiom, its typical implementation in Rust, and then proceeds to propose a better alternative. But the question arises: why should you care?

To answer why this idiom is useful, I want to present you with a real production problem we faced in the Storage Infrastructure team at Google circa 2010. That issue made me a convert and I’ve kept it in mind when designing APIs since then.

What is “new type” anyway?

The “new type” idiom is a programming pattern whereby you define domain-specific types to wrap native types. Then, you use those types in your code and only “lower” them to their underlying types when passing the values to any external APIs.

More specifically, your public interfaces—but also private methods and functions—should never receive primitive types like integers or strings. A function like allow_write(username: String, size: usize) should not exist in your code base.

Instead, what you’d do is define two new types, Username and Size, that wrap the string and integer values. These types are trivial wrappers over the native types (e.g. struct Username(String) and struct Size(usize)) with the goal of making them zero-cost abstractions.

However, as David’s article describes, this simplistic way to implement the pattern lends itself to code duplication for every type. And, as you know, code duplication leads to slight inconsistencies over time, which may increase maintenance costs.

But this is not relevant now. “New type” is “new type” no matter how you implement it and no matter which language you choose. What I want to showcase here is why adopting this pattern helps at all. And, for that, it’s time to dive into the story.

Quota management outage

Back in the early days of the Storage Infrastructure at Google, we the SRE team managed tens of shared file system deployments (aka cells). Those file systems were multi-user and had support for quotas—like pretty much any file system worthy of consideration.

Storage quotas are typically expressed as two quantities: a bytes quota, which says how much disk space a user can use; and a files quota, which says how many files a user can store. Tracking byte usage limits disk usage and tracking file usage limits metadata overhead.

Now, even with the replicated and zonal system that Google had, it was convenient to manage user quotas in a centralized way. So that’s what we also had: the equivalent of a MySQL database tracking how much quota each user had allocated in which cell, world-wide.

So far so good. But how do you keep the decentralized storage cells decoupled from the quota database? Easy! Via a cron job that reads the quotas and pushes them to cluster-local “configuration” files so that each cell doesn’t need to know about the central database.

Of course, this cron job is code. And code has bugs. So here is what happened: there was some function somewhere that looked like this: set_quota(user: String, bytes: usize, files: usize). And, for whatever reason, a caller invoked it as set_quota(user, files, bytes).

Notice the problem? The caller swapped the arguments but the language did not flag this issue and… the tests didn’t catch it either! It’s unfortunately too common for people to write the same sentinel values (set_quota("foo", 100, 100)) for different test arguments.

Soon after this change was checked in, the code rolled out to production and… of course, the moment it touched the first canary deployment, things went south. Users started receiving out of quota alerts even when they should have had, in theory, plenty of available quota.

A blog on operating systems, programming languages, testing, build systems, my own software projects and even personal productivity. Specifics include FreeBSD, Linux, Rust, Bazel and EndBASIC.

0 subscribers

Follow @jmmv on Mastodon Follow @jmmv on Twitter RSS feed

The aftermath

The fix to the above was trivial: swap the arguments in the problematic caller and call it a day. But that’s a naïve fix. It just corrects the immediate problem but does not minimize the chances of it happening again. That is: this change is a mitigation, not a fix.

In good SRE fashion, the team went a bit further to prevent this issue from ever recurring. They adopted the “new type” pattern, created two separate types—BytesQuota and FilesQuota—and plumbed them through the whole program. And note: Rust wasn’t really a thing in 2010.

Furthermore, the team contributed an entry to the Testing on the Toilet blog, describing the issue and how this same problem had just bitten a major financial company and had made them lose millions of dollars.

And that’s the simple story of the day. Please adopt the new type pattern. Future-you will be happy that you did when the compiler or language interpreter yells at you about something that’s obviously wrong.