登录查看更多内容

Pre-Mortems and Chaos.

Amit Sengupta

Portfolio Lead (Associate Director) - Cloud and Open Source Capability Unit at CAPGEMINI AMERICA, INC

发布日期: 2021年5月15日

Software failures and operational outages are nightmares to the engineering workforce. To get ahead of unplanned failures there is a need for “Pre-Mortem”. As opposed to a post-mortem, where one tries to analyze what caused a failure (or death), a pre-mortem entails envisioning the possible ways an idea or a feature could fail (or die) before even we begin to build it.

The exercise is done to ensure that we take all such potential failure scenarios into account while designing and building a feature. The pre-mortem can be applied to any decision that we are about to make. If we are able to envision all the ways in which the decision could turn up unfavorable consequences, we can then outline the things we need to take into account while making the decision and put in contingencies where necessary.

The basic approach to start practicing Pre Mortems is to specifically address the uncertainty of distributed systems at scale, Chaos Engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. These experiments can follow four steps:

1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.

2. Hypothesize that this steady state will continue in both the control group and the experimental group.

3. Introduce variables that reflect real-world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.

4. Try to disprove the hypothesis by looking for a difference in steady-state between the control group and the experimental group.

5. The harder it is to disrupt the steady-state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large.

The idea below is to walk through some tech experiments and provide corresponding real-world scenarios.

Chaos Engineering Experiments 1 - Load Balancers

In the tech world, load balancers distribute incoming network traffic across a group of backend servers. They route requests to ensure they’re handled with maximum speed and efficiency. If a server goes down, the load balancer adjusts — routing and distributing traffic to the other servers.

With chaos engineering, you can test your load balancer’s settings to see if they’re optimal for reducing outages. You can run an experiment where you deregister a target from your load balancer’s target group and observe what happens. Will traffic still be routed and distributed efficiently, or will it crash the system, preventing our procrastinating protagonist from buying his mom a timely gift?

In the real world, think of this experiment as supermarket checkout lines. You’ve got your standard lines with a cashier, your 10-Items Or Fewer lines, and then your self-checkout station. But what happens when people don’t respect the 10-item limit or bring their weeks’ worth of groceries through the self-checkout?

Your load balancer would be like a shift manager making sure that enough registers are up and running, reminding people to respect the 10-item limit rule, and directing longer lines to other registers to even out the wait times. If it does its job correctly, no small children will be lying down on the floor, waiting out the interminable lines.

Chaos Engineering Experiment 2 — Security Groups

Security groups are essentially virtual firewalls. Their rules control the inbound traffic that’s allowed to reach your instances and the outbound traffic that’s allowed to leave them. They protect your resources by ensuring they’re only exposed to trusted resources and IP addresses. “Never trust/always verify” is a core principle of a well-managed security approach.

A great chaos engineering experiment is to swap out the security groups for a specified load balancer. What happens if a random security group sets the rules? Will non-trusted traffic still pass through? The whole point of these experiments is to find issues before they become production problems.

In the real world, think of this experiment as TSA at the airport. The agents are pretty much almost always professional, courteous, and do the best they can. But what would happen if you switched out these trained professionals with random folks off the street? You could end up seated next to a passenger who has snuck their carnivorous house pet through customs.

A trained TSA agent isn’t going to let an animal inconsistent with its guidelines onto the flight. When your chaos engineering experiments expose similar security group gaps, you can work to mitigate them.

Chaos Engineering Experiments 3 — CPU Spikes

Sometimes your local machine is going to run slowly, like when you miss your morning cup of coffee. There can be any number of reasons for the lags, but prolonged speed issues generally indicate a CPU spike issue (i.e., a CPU hog). You’ve got a process stuck somewhere and it’s keeping other programs from running properly. Maybe you opened up that phishing link against your better judgment or maybe your bored kids borrowed your laptop and downloaded every game, show, and the movies they could find. Whatever happened, your machine is taking forever to load!!

You can run a chaos engineering experiment to force a CPU spike to see how well different apps on your local machine function under the stress. You can even customize the spike percentages to reflect varying degrees of spikiness. It’s a great way to test your system’s resiliency and find your thresholds for handling volume. You can find out the breaking point between acceptable performance and seriously considering taking your machine to a witch doctor to exorcise the demons inside.

Think of CPU spike experiments as being the beginning of a new month when 100’s of new shows and movies hit your favorite streaming services all at once. You want to watch them all, and then you just kind of freak out because there are so many options to choose from. Can you handle the binge-watching overload, or will your wishes just pile up past the point of no return!?!

Chaos Engineering Experiments 4 — Drain Nodes.

Okay, so draining nodes sounds kind of gross, but all this really means is that you’re evicting pods from your node in Kubernetes. What you’re doing here is ensuring you no longer have any pods scheduled on your nodes, and any currently active running pods are evicted. You always want to allow them to terminate gracefully when they’re no longer needed — giving you a chance to clean them up.

You can run chaos engineering experiments to identify either specific or random nodes to drain. You can also set parameters, like how many seconds you want to wait for the nodes to drain or how many random nodes should be affected. So what will happen if your nodes drain without you cleaning them up first? The container orchestration struggle is real.

I think of this experiment as a terrible roommate. You know, the one who’s always late on rent (but never late to go out for drinks), eats your food, clogs your toilet, and never does dishes. While you’re definitely ready to move on from this roommate, you want to do it in a way that won’t cause any new problems. A graceful eviction here means providing notice and following all the legal guidelines and best practices for kicking someone out of your flat. You don’t need any more drama in your life. That’s true for containers, too — drain your nodes, and keep your containers running in harmony.

While we’ve all experienced quite enough chaos in our daily lives these days, injecting some chaos into your App Dev in a careful and controlled environment is just good practice. We need to figure out where the stressors and inflection points await, and then work to mitigate them before they turn into production incidents.

要查看或添加评论，请登录

Amit Sengupta的更多文章

AIOps - Embracing the Future of IT Ops.

2024年3月23日

AIOps - Embracing the Future of IT Ops.

The digital landscape is ever evolving. With each passing day, we see new innovations that drive businesses to adapt…
GitHub Copilot X Vs AWS CodeWhisperer - The Race ahead in AI Assisted Code Development

2023年9月15日

GitHub Copilot X Vs AWS CodeWhisperer - The Race ahead in AI Assisted Code Development

GitHub Copilot X and AWS CodeWhisperer are two AI-powered code generators that have been making waves in the developer…

1 条评论
Concept of Distributed Tracing (A Step ahead on Open Telemetry)

2023年1月26日

Concept of Distributed Tracing (A Step ahead on Open Telemetry)

Microservices provides enormous agility and flexibility to software development process. By partitioning large…

1 条评论
Top Five - Open Source Kubernetes Best Practices for effective workload management

2022年8月2日

Top Five - Open Source Kubernetes Best Practices for effective workload management

While provisioning a Kubernetes cluster is relatively easy, each new cluster is the beginning of a very long journey…
The "STAR" Strategy for execution of Business Engagements

2022年3月26日

The "STAR" Strategy for execution of Business Engagements

Ever since the emergence of social media and the declining influence of advertising in the early 2000s, customers have…
Event Streaming Race - Data Collection and Open Telemetry

2021年7月18日

Event Streaming Race - Data Collection and Open Telemetry

Data collectors and the unsung heroes behind every smart analytics. They are the hard-working daemons that run on…
Let Your Microservices Dance – Orchestration vs Choreography

2021年3月4日

Let Your Microservices Dance – Orchestration vs Choreography

Introduction :- The Microservice Architecture is a collection of small services with each service having a specific…

2 条评论
Predictive Analytics

2021年1月9日

Predictive Analytics

“What Happened” Vs “What Will Happen” Introduction:- The world we live in today is majorly influenced by Predictive…
Event Based IT Systems

2020年8月13日

Event Based IT Systems

Signal Vs Noise - How to figure out what matters - Solve Problems Faster The business world is in the midst of a…

See all articles

Chaos Engineering Experiments 1 - Load Balancers

Chaos Engineering Experiment 2 — Security Groups

Chaos Engineering Experiments 3 — CPU Spikes

Chaos Engineering Experiments 4 — Drain Nodes.

Amit Sengupta的更多文章

AIOps - Embracing the Future of IT Ops.

GitHub Copilot X Vs AWS CodeWhisperer - The Race ahead in AI Assisted Code Development

Concept of Distributed Tracing (A Step ahead on Open Telemetry)

Top Five - Open Source Kubernetes Best Practices for effective workload management

The "STAR" Strategy for execution of Business Engagements

Event Streaming Race - Data Collection and Open Telemetry

Let Your Microservices Dance – Orchestration vs Choreography

Predictive Analytics

Event Based IT Systems