Flying Squirrels: A Survival Guide for New Engineering Managers (Multi-Part Series, Part 7)
Matt Simons
Technology Leader, Keynote Speaker, Overanalyst, Self-Appointed Chancellor of Awful Metaphors, and French Toast Enthusiast
This is part 7 of a multi-part series. If you missed part 6, it's in the link above!
Constraints
A good friend and former coworker loves to say that "engineering without constraints is just play."
Managing engineers can be fun and rewarding, but it is much closer to engineering than it is to play. There are constraints. I was once told that every problem we encounter in life is an exercise where we are allowed precisely one optimization. Everything else is a constraint.
When building teams, sustained (and sustainable) performance is the optimization. Humans are the platform. In this context, it's important that we understand the constraints of the platform. We've touched on them a bit in the discussion around Conway's Abstraction, but let's dive in in greater detail now.
The Nodes are Unreliable
We generally speak about node reliability in terms of availability. The idea that nodes can spontaneously fail or drop from the cluster has pretty readily understood analogues in the human system. We tend to staff our teams with some amount of redundancy to deal with unexpected dips in availability. Sick days, paid time off in all its various forms, and family and medical leaves of absence are commonly encountered and commonly planned for in our organizations. While we tend to prepare for it less consistently, the idea of irreplaceable node loss is also not foreign to us. We speak about lone repositories of tribal knowledge in terms of our "bus number". If you're not familiar with that phrase, the idea roughly translates to "the number of people within a given sphere of competency that could get hit by a bus before the organization would be screwed." As an example, a critical part of your application with a bus number of one means that only one person has the knowledge or competency to maintain or run that part of the product, and if that one person gets hit by a bus you're in trouble.
Availability is important, but it's not the only facet of reliability that defines this constraint we have to account for. Perhaps more important and less discussed is the constraint around node consistency. Humans are notoriously inconsistent. We could talk about that lack of consistency in colloquial terms by framing it as a lack of follow-through on the things we say we'll do. We could also see human inconsistency in all the weird, quirky things we do, but I want to stay focused on parallels to computational and engineering problems. Accordingly, the definition of node consistency that we will employ in this discussion looks like this:?
Group consistency is achieved when the same instruction set can be processed by multiple nodes and result in identical outcomes. Individual consistency is expressed as the ability for a single node to process the same instruction set multiple times and achieve identical results.
"Buy two loaves of bread, and if they have avocados, buy three", is an oft-joked about instruction set that could result in someone coming home with two loaves of bread and three avocados, or three loaves of bread and zero avocados, depending on the shopper. Understandably, you're probably thinking "well, that's a stupid example because the problem there is the quality of the communication", but increasing the quality of the communication is a coping mechanism for dealing with node consistency issues. It's a great example because it shows that node consistency issues are so prevalent and our knowledge of them is so ingrained that we don't even think about node consistency as the issue here, and instead immediately place blame on the individual who failed to employ the appropriate mitigation strategy for node consistency issues that we know are present in the system.
Group consistency is easier to deal with in some ways because groups are generally consistent in the aggregate. Individual consistency is a more difficult problem because humans are neither closed nor static systems. Even a simple communication with almost no room for ambiguity, something like "I love you", spoken to someone with whom you share a close relationship has the capacity to create varied responses, depending on contextual and environmental variables.
Humans and groups of humans lack consistency. This poses significant challenges to achieving consistent results if your platform is made of humans.
The Execution Environment is Unpredictable
The day is February 28th, 2017. It's 9 in the morning and you're at the office. You're still dragging a bit because you've only been in the office for like 30 minutes and you spent the first 20 of that getting coffee. It took longer because Dave was about 15 steps ahead of you walking towards your favorite espresso machine and you really didn't want to get dragged into another one of his trademark 30 minute diatribes about how "crypto currency is totally the future, man." Fucking Dave. So you did the only sensible thing and turned around to get your caffeine from the other end of the building, but you're a little miffed you had to do that. As you get logged in and situated, you see that monitoring for the billing system has flagged a minor issue that's affecting performance.
No bigs.
There's a playbook for the issue, because of course we've seen this problem before and of course we haven't actually fixed the root cause and of course the sensible thing is to just keep manually kicking servers whenever the issue manifests, which is like way too goddamn often. You sigh, sip your caffeinated beverage of choice and close your eyes, trying to force some small moment of zen.?
Alright, break's over. Let's bounce some servers.
You start running through the series of operations and get to where you can now enter the command to terminate the affected hosts, which will cause the autoscaling policies to kick in and bring up some fresh replacements.
领英推荐
Just a little more tippity tappity and there... we... go.
Should just take a few minutes for things to come back and we should see performance metrics start to normalize once the new boxes chew through the request backlog.
Within seconds, alarm bells for almost every major service start going off.?
Shit.
You feel a pit forming in your stomach so deep it's like a black hole in your chest. Fuck. Oh god oh god oh god. Did I do that? Oh please tell me that wasn't me. It couldn't have been. I was careful, right? I just followed the playbook, right? Shit shit shit shit shit. I can prove it wasn't me. I'll just look back through what I entered... yeah... that's right... that's right.... I did that one just like I was supp--FUCK!
Oh god it was me. I did it. Goddammit, Dave. This is all your fault for not letting me get my usual coffee. Fucking crypto.
...
This is how I imagine the infamous Amazon S3 outage of 2017 went down. It was the day the internet stood still. As we now know, a monstrously huge chunk of the internet was hosted entirely in one AWS region, and temporarily losing the persistent storage layer was kind of a big deal. Not everything was borked, but a truly, surprisingly large percentage of popular services discovered a critical reliance on one specific region of S3 that day. Amusingly, even the service health dashboard that AWS used to communicate the up/down/degraded status of its services to customers was broken. It was like going to check on the canary in the coal mine only to discover you'd gone entirely blind.
What's most important for this conversation, though, is that for many a company, both small and large, February 28th, 2017 was the day that they learned an important lesson about the unpredictability of execution environments. It probably wasn't the first time they'd been exposed to the concept, but for many, the day of the S3 outage was the day that their understanding went from being academic to experiential. It was a visceral reminder that the systems we build are subject to major disruption at almost any time. In this case, the source was a fatfingered command entered by a developer, but it could have just as easily been a backhoe enthusiast convention, a solar flare, a tsunami, or the first salvo of an international armed conflict. Any of those things are possible, and all of them could present disruptions, not just to our products, but to our organizations as well.
It's easy to forget when we talk about these kinds of events in the abstract, but on the other side of all the disruption caused to services were individuals. And sure, some of those disruptions were minor. Someone couldn't listen to music. Someone couldn't watch a favorite show. Someone couldn't buy an overpriced handbag. But there were others who experienced the outage very differently. Someone couldn't pay a bill online and it ended up going to collections, hurting their credit and setting back their plans for home ownership. Someone else couldn't look for a job during the one break they had that day between other responsibilities. Someone else wasn't able to have a video call with their sick father, and missed a chance to see his face as he said goodbye and "I love you" one last time.
The idea that cyberspace and the physical world are somehow separate really starts to break down when we examine events like the S3 outage of 2017. The truth is that all of cyberspace lives in meatspace, and that an increasingly large portion of the physical world is represented in and tied to cyberspace. Once upon a time we might have drawn the causal relationship as directional, with disruptions only flowing from the physical world to the world of computing and distributed services, but we're too intertwined now for that to be true. It's not necessarily a good thing or a bad thing -- it's just a thing that we have to be aware of and plan for when we build products.
And when I say “products”, if you've been paying attention, you've perhaps already made the substitution in that last sentence that I'm going to suggest now, which is to say that the problem of the unpredictable execution environment is one that we have to deal with not only when we build products, but also when we build teams and organizations.
As I sit, now, in my home, the year is 2021 and I'm writing this section of the book from a chair where I often sit when I work for my employer during the weekdays. We're at a point in the COVID pandemic where many Americans have received a vaccine against the disease, and social restrictions have started to relax. The software industry is among those that have fared the best to this point in dealing with the pandemic. Many, many businesses have struggled or shuttered their doors entirely, never to reopen. And while it might be the most recent and easily recognizable example of disruptive events at scale, it's certainly not the first disruption of its kind and won't be the last. Organizational casualties are often a result of natural disasters, political upheavals, and the forward march of technological transformation.
The challenge of working within an unpredictable execution environment is a constraint of the human platform. As a constraint, we can't change it. We can only work to mitigate its impact and try to function within its boundaries.
...
(continued in Part 8, below)