Pre-Mortems and Chaos.
Amit Sengupta
Portfolio Lead (Associate Director) - Cloud and Open Source Capability Unit at CAPGEMINI AMERICA, INC
Software failures and operational outages are nightmares to the engineering workforce. To get ahead of unplanned failures there is a need for “Pre-Mortem”. As opposed to a post-mortem, where one tries to analyze what caused a failure (or death), a pre-mortem entails envisioning the possible ways an idea or a feature could fail (or die) before even we begin to build it.
The exercise is done to ensure that we take all such potential failure scenarios into account while designing and building a feature. The pre-mortem can be applied to any decision that we are about to make. If we are able to envision all the ways in which the decision could turn up unfavorable consequences, we can then outline the things we need to take into account while making the decision and put in contingencies where necessary.
The basic approach to start practicing Pre Mortems is to specifically address the uncertainty of distributed systems at scale, Chaos Engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. These experiments can follow four steps:
1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
2. Hypothesize that this steady state will continue in both the control group and the experimental group.
3. Introduce variables that reflect real-world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
4. Try to disprove the hypothesis by looking for a difference in steady-state between the control group and the experimental group.
5. The harder it is to disrupt the steady-state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large.
The idea below is to walk through some tech experiments and provide corresponding real-world scenarios.
Chaos Engineering Experiments 1 - Load Balancers
In the tech world, load balancers distribute incoming network traffic across a group of backend servers. They route requests to ensure they’re handled with maximum speed and efficiency. If a server goes down, the load balancer adjusts — routing and distributing traffic to the other servers.
With chaos engineering, you can test your load balancer’s settings to see if they’re optimal for reducing outages. You can run an experiment where you deregister a target from your load balancer’s target group and observe what happens. Will traffic still be routed and distributed efficiently, or will it crash the system, preventing our procrastinating protagonist from buying his mom a timely gift?
In the real world, think of this experiment as supermarket checkout lines. You’ve got your standard lines with a cashier, your 10-Items Or Fewer lines, and then your self-checkout station. But what happens when people don’t respect the 10-item limit or bring their weeks’ worth of groceries through the self-checkout?
Your load balancer would be like a shift manager making sure that enough registers are up and running, reminding people to respect the 10-item limit rule, and directing longer lines to other registers to even out the wait times. If it does its job correctly, no small children will be lying down on the floor, waiting out the interminable lines.
Chaos Engineering Experiment 2 — Security Groups
Security groups are essentially virtual firewalls. Their rules control the inbound traffic that’s allowed to reach your instances and the outbound traffic that’s allowed to leave them. They protect your resources by ensuring they’re only exposed to trusted resources and IP addresses. “Never trust/always verify” is a core principle of a well-managed security approach.
A great chaos engineering experiment is to swap out the security groups for a specified load balancer. What happens if a random security group sets the rules? Will non-trusted traffic still pass through? The whole point of these experiments is to find issues before they become production problems.
In the real world, think of this experiment as TSA at the airport. The agents are pretty much almost always professional, courteous, and do the best they can. But what would happen if you switched out these trained professionals with random folks off the street? You could end up seated next to a passenger who has snuck their carnivorous house pet through customs.
A trained TSA agent isn’t going to let an animal inconsistent with its guidelines onto the flight. When your chaos engineering experiments expose similar security group gaps, you can work to mitigate them.
Chaos Engineering Experiments 3 — CPU Spikes
Sometimes your local machine is going to run slowly, like when you miss your morning cup of coffee. There can be any number of reasons for the lags, but prolonged speed issues generally indicate a CPU spike issue (i.e., a CPU hog). You’ve got a process stuck somewhere and it’s keeping other programs from running properly. Maybe you opened up that phishing link against your better judgment or maybe your bored kids borrowed your laptop and downloaded every game, show, and the movies they could find. Whatever happened, your machine is taking forever to load!!
You can run a chaos engineering experiment to force a CPU spike to see how well different apps on your local machine function under the stress. You can even customize the spike percentages to reflect varying degrees of spikiness. It’s a great way to test your system’s resiliency and find your thresholds for handling volume. You can find out the breaking point between acceptable performance and seriously considering taking your machine to a witch doctor to exorcise the demons inside.
Think of CPU spike experiments as being the beginning of a new month when 100’s of new shows and movies hit your favorite streaming services all at once. You want to watch them all, and then you just kind of freak out because there are so many options to choose from. Can you handle the binge-watching overload, or will your wishes just pile up past the point of no return!?!
Chaos Engineering Experiments 4 — Drain Nodes.
Okay, so draining nodes sounds kind of gross, but all this really means is that you’re evicting pods from your node in Kubernetes. What you’re doing here is ensuring you no longer have any pods scheduled on your nodes, and any currently active running pods are evicted. You always want to allow them to terminate gracefully when they’re no longer needed — giving you a chance to clean them up.
You can run chaos engineering experiments to identify either specific or random nodes to drain. You can also set parameters, like how many seconds you want to wait for the nodes to drain or how many random nodes should be affected. So what will happen if your nodes drain without you cleaning them up first? The container orchestration struggle is real.
I think of this experiment as a terrible roommate. You know, the one who’s always late on rent (but never late to go out for drinks), eats your food, clogs your toilet, and never does dishes. While you’re definitely ready to move on from this roommate, you want to do it in a way that won’t cause any new problems. A graceful eviction here means providing notice and following all the legal guidelines and best practices for kicking someone out of your flat. You don’t need any more drama in your life. That’s true for containers, too — drain your nodes, and keep your containers running in harmony.
While we’ve all experienced quite enough chaos in our daily lives these days, injecting some chaos into your App Dev in a careful and controlled environment is just good practice. We need to figure out where the stressors and inflection points await, and then work to mitigate them before they turn into production incidents.