Showing 10 posts
Having a fast and responsive app is orthogonal to “knowing your big Os”. Unfortunately, most tech companies over-emphasize algorithms in interviews and downplay systems knowledge, and I believe that’s one reason behind sluggish apps and bloated systems. I’ve seen this play out repeatedly. Interviewers ask a LeetCode-style coding question, which is then followed by the ritual of discussing time and memory complexity. Candidates ace the answers. But then… their “real” code suffers from subtle yet impactful performance problems. Focusing on big O complexity rarely matters in most apps. Sure, it’s important to think about your algorithmic choices, but there are so many more details to worry about that have a direct impact on app performance and responsiveness. Let’s look at a bunch of them!
“Fast machines, slow machines”… ah, the post that spawned these series. As I frantically typed that article while replying to angry tweets, the thought came to mind: software engineering as a whole is hyper-focused on lowering the costs to write new code, yet there is a disregard for the costs that these improvements bring to other disciplines in a company on even to end users. So, in this series finale, I want to compare how some choices that apparently lower development costs actually increase costs elsewhere. I also want to highlight how, if we made different decisions during development, we could possibly expose those extra costs early on. This is beneficial because exposing costs upfront allows us to make tough choices when there is still a chance of changing course. To make things specific, I will look at how the use of modern frameworks that facilitate development can end up hurting performance, reliability, and usability. So let’s start with a three-part rant first (sorry) and then let’s look at what we might do.
In the previous post, I proposed that certain engineering practices expose systemic costs and help with planning while other practices hide those same costs and disturb ongoing plans. The idea I’m trying to convey is hard to communicate in the abstract so, in that post, I used the differences between a monorepo and a multirepo setup as an example. Today, I’ll expore a different scenario to support the same idea. I’m going to talk about how certain ticket assignment practices during on-call operations can expose service support costs vs. how other practices hide them. Keep in mind that, just like in the previous post, I do not want to compare the general merits of one approach vs. the other. The only thing I want to compare is whether one approach centralizes toil and allows management to quantify its cost vs. how another approach hides toil by smearing it over the whole team in hard-to-quantify ways. Whether management actually does something to correct the situation once the costs are exposed is a different story.
Being part of an on-call rotation is a requirement for many software job positions and fulfilling this requirement should not come with stress. One process that can cause friction is how a team schedules its on-call shifts. If the on-call scheduling process is haphazard, team members will end up with shifts over their personal plans, and they’ll be unhappy. But there are ways to prevent this, which is what this post is about.
A recent tweet that caught my attention read: “principal engineers should be on-call”. Of course they should be! I’m “surprised” they aren’t everywhere, but I can imagine some reasons to justify their situation. Let’s change that in this thread. 🧵 👇
In a previous thread, I covered some techniques to approach on-call shifts and maintain your own well-being. In this thread, I will touch upon the things you can do, as a team, to make your service more sustainable. 🧵 👇
Last week, I was first-time on-call for a part of Azure Storage. My previous background as an SRE at Google helped me remain calm despite my inexperience. And as we have more first-time on-callers joining soon, I couldn’t resist writing some advice for them. Let’s start! 🧵
Say you want to copy a large collection of files to a file server on your same network. What’s the fastest way to do this initial copy? Physically attaching the drive to the server? Maybe, but will the file systems be compatible? What about using the network? If so, which protocol? Read on for more details and how tar plus Netcat delivered the best results.
You probably know that software rewrites, while very tempting, are expensive and can be the mistake that kills a project or a company. Yet they are routinely proposed as the solution to all problems. Is there anything you can do to minimize the risk? In this post, I propose that you actively improve the old system to ensure the new system cannot make progress in a haphazard way. This forces the new system to be designed in such a way that delivers breakthrough improvements and not just incremental improvements.
There are two ways to handle abnormal conditions in a program: errors and assertions. Errors are a controlled mechanism by which the program propagates details about a faulty condition up the call chain—be it with explicit error return statements or with exceptions. Errors must be used to validate all conditions that might be possible but aren’t valid given the context. Examples include: sanitizing any kind of input (as provided by the user or incoming from the network), and handling error codes from system calls or libraries.