What is Chaos Engineering
Priyal Walpita
CTO at ZorroSign Inc. | Expert in AI/ML, Blockchain, Quantum Computing, Cybersecurity & Secure Coding | Digital Security Innovator | Mentor & Trainer in Advanced Tech
Have you ever been in a situation where a perfectly tested system crashed because of an unforeseeable scenario? Web systems have grown complex with the increasing number of microservices and distributed cloud architectures. It’s alarming because we all depend on these systems more than ever, yet failures have become much harder to predict.
In this article, you will get an introduction to Chaos engineering and tools we can use to implement Chaos Engineering principles.
Everything today is fast phased. Fast food, fast cars you name it. It’s the same with software system development as well. For example, we do rapid development & deployment with automated CI/CD pipelines, content drive deployments, etc. Everything is fast phased today — unlike the early days of software development, the time it takes from developer’s machine to production environment is short now. We used to have monthly releases, right now there are software systems that do production releases every passing hour, or even minutes.
In Traditional deployment architecture, we used to have monoliths. A monolithic architecture is the traditional unified model for the design of a software program.
Modern Architecture Deployment
The below diagram is Netflix’s deployment diagram.
Netflix uses lots of microservices, While a monolith is a single, large unit, a microservice architecture uses small, modular units of code that can be deployed independently of the rest of a product’s components.
However, In this kind of complex architecture, everything is well until you get errors in the production environment. If one area fails, it’s gonna affect many other areas. All the teams, including the support team, DevOps team then the whole technical team, have to get together to try and resolve a problem. But it’s impractical in larger software products.
The only solution to address these outages is to reproduce production scenarios in the pre-production or staging environments. Nevertheless, it’s challenging to create all such scenarios beforehand.
We will have to tackle four possible scenarios in order to fabricate real-life scenarios in a staged environment;
- Known Knowns — Known events/expected consequences (Things you are aware of and understand)
- Known Unknowns — Known events / unexpected consequences (Things you are aware of but don’t fully understand)
- Unknown Knowns — UnKnown events / expected consequences (Things you understand but are not aware of)
- Unknown Unknowns — UnKnown events / unexpected consequences (Things you are neither aware of nor fully understand)
In chaos engineering, we mostly focus on the last two categories. Now you may ask why we got to deal with unknowns. The reason for this is because most modern-day software is complex systems, meaning we have to deal with lots of data & complex requirements.
Let’s examine the difference between these two types of software systems. In simple systems, the effect of a change is linear, meaning that the outcome or the overall effect on the system is linear if you make a change. Therefore you can predict the outcome in a simple system. However, in a complex system, the effect of a change will be exponential. For example, think of the bullwhip effect. The amplitude of a whip increases down its length; the further from the originating signal, the greater the distortion of the wave pattern. Similarly, forecast accuracy decreases as one moves upstream. Furthermore, simple systems are comprehensible, to simply put, the change is foreseeable via a human brain, whereas in a complex system, it’s impossible to come out with a simple mental model as such.
There’s another framework that can be used to define complex systems. That’s called the Dynamic Safety Model.
In every complex system, you will have to negotiate three components.
- Economics
- The Workload
- Safety
You can assume that In a complex system, all these components are tied together using a rubber band. Let’s say while you are working within a budget, you cannot ask your manager for 100,000 AWS virtual machines, likewise, you can’t compromise the safety of the system to adhere to an agreed budget. You will have to balance all these three aspects from a central point in a complex system.
Simultaneously, If you are developing a complex system, you will have to deal with the dark debts.
What’s Dark Debt?
Simply put, dark debts are the hidden vulnerabilities in a stable IT environment that will eventually wreak havoc. Dark debt can be referred to as anything that happens that wasn’t planned for or defended against.
Small anomalies in a complex system can lead to total system failures, where these failures are not recognizable during the design, development, or QA processes. To add to the matter, there are no specific countermeasures for these anomalies as well.
That’s why we need mechanisms such as chaos engineering when developing complex systems.
Then arises the need to evaluate the complex systems.
The best way to evaluate a complex system is to create a troublesome turbulence condition in a controlled environment and inspect the outcome.
One timely example we can think of such a scenario is the vaccine we are getting for the covid virus. A vaccine is a mild version of the virus itself that triggers our immune system. A vaccine is a good example of creating a troublesome condition in a controlled environment. Later, when the virus attacks us, our body already knows how to behave.
What is Chaos Engineering
“Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system’s capability to withstand turbulent and unexpected conditions” [Ref:https://en.wikipedia.org/wiki/Chaos_Engineering].
Most of the complex systems are distributed systems. So let’s see how we are going to achieve this by analyzing the process behind it.
The Process (The principal)
- Define the steady-state — in your complex system, you have to define the steady-state — this can be like how the system is behaving, the number of users, etc.
- Continue the steady-state in both experimental & controlled groups— you have to have two groups, experimental groups & controlled groups, to introduce real-world production.
- Introduce variables that reflect real-world events — What can happen in a real-world production system? For example, there might be a
- DDOS (Distributed Denial of Service) attack, or in some of your AWS regions VM (virtual machine) regions can go down, which may cause you to change to DR (disaster recovery), there may be lots of users trying to buy via your e-commerce when it’s black Friday, and the web traffic would rise. So here, you are trying to fabricate a real-world scenario in a test environment artificially.
- Try to disprove the hypothesis by looking at the results — So you are trying to disprove the hypothesis you made earlier. (We will talk about a concept called a game day, which we use to disapprove the hypothesis later in this article.)
Where to start chaos engineering
Objectives of chaos engineering are;
- Confirming known knowns
- Testing the unknowns
- Finding unknown unknowns
Let’s see how we can achieve these objectives by using a scenario called Game Day.
The goal of a GameDay is to increase reliability by purposefully creating major failures regularly. They also help facilitate the value of Chaos Engineering.
Objectives of the game day.
- How well the altering system works
- How team members react to incidents
- Indications of system health (is monitoring systems reflecting the actual systems)
- How to support and DevOps teams reacting to the turbulence
- How the dev teams react to the turbulence
Plan the Game Day
- Pick a hypothesis to explore from the testing backlog
- Pick the style
- Informed Style — In this style, you will be telling your team that you are going to have a chaos engineering game, and things like — how you plan to do it, time, and the areas of the system that will be affected will be properly defined. All the teams, such as the DevOps, developers, support teams, are included.
- Dungeons & dragons — The team would think it’s something that happened for real, and then you’d have to investigate how the system is reacting in such scenarios.
The Dungeons & dragons style is most appropriate if you have your processes streamlined. It’s essential for your monitoring systems, redundancy systems, backup systems to be in place.
If you are an immature system, it’s smarter to go ahead with the Informed style.
3. Decide who is participating
The next step would be to decide the participants.
It’s better to have everybody involved — not only people from development, DevOps,
support, it’s also better to have a representation from the management team as well.
4. Where it’s happening — if you adopt the informed style, it’s ideal to have the whole team in one room. Furthermore, you have to decide which area of system you plan to run the testing.
5. Duration — You have to decide whether the game day is happening within one day or, few hours, etc.
6. Design the chaos experiment plan
Following is an example
- Steady-State Hypothesis — The “/“ the root URL should respond within 200 status codes within 1 second.
- Method — Disconnect DBI cluster from the network
- Rollbacks — Reconnect DBI cluster — ( It’s important to have a rollback plan in case something goes wrong, as the team would know what exactly to do )
7. Get the approval — It’s important to inform the relevant parties including the management to avoid creating panic amongst the customers, and other users of the system.
Are we applying this chaos engineering for technical teams only?
The answer is no.
We can do it for Bussiness Analytics as well. For example, in a scenario where all your Bussiness Analysts are on leave, who would do the analysis, and how it would work? We can use the principles of chaos management to stimulate such a scenario.
Can we do chaos engineering testing manually?
The answer is no. Because it is just too complex for a manual process.
We have lots of tools for Chaos Engineering.
- Netflix Simian Army
- Litmus
- Chaos Mesh
- Chaos Toolkit
I am gonna give you a very brief introduction to the Netflix Simian Army’s tools.
Netflix’s Simian Army
There’s below tools in Netflix’s Simian Army
- Chaos Monkey is a resiliency tool that’s invented by Netflix to test the resiliency of its IT infrastructure by Intentionally disabling computers in the production network to test how remaining systems (VMS) respond to the outage. They first implemented this in AWS, so they randomly taking down EC2 instances in AWS infrastructure. Then they analyze how the system would behave if they take down certain EC2 containers.
- Chaos Kong is like a bigger version of chaos monkey. It is taking down the entire availability zone or a region in AWS. Then they analyze how that particular scenario affects the entire system. By dropping a full cloud “region” or availability zones, Chaos Gorilla simulates a system response and recovery to region-level failure.
- Latency Monkey — Introduces communication delays to simulate degradation or outages in a network. It is artificially creating some latency or bottleneck in communication channels of how the system behaves in a delayed environment, or environment which has high latency.
- Conformity Monkey — A tool that determines whether an instance is nonconforming by testing it against a set of rules. If any of the rules determine that the instance is not conforming, the monkey sends an email notification to the owner of the instance. For example, one rule might say for one specific service there should be five instances up at any given time. So Confimirty monkey will be checking if there are these five instances up. If not it will be sending an email.
- Security Monkey — Security Monkey was created as an extension to Conformity Monkey, it searches for the disabled instances that have known vulnerabilities or improper configurations and it locates potential security vulnerabilities and violations.
What can we do in chaos Engineering?
- Execute real-world events
- Automate Experience
- Minimize blast radius
Is it advisable to do chaos testing in a QA environment or in a pre-production environment?
Usually, chaos engineering tests are supposed to be carried out in the production environment. However, it’s not practical to take down an entire production environment. We need to minimize the blast radius and should have minimal impact on customers. You have to define the blast radius for each of the chaos engineering tests and keep all of the teams constantly communicating and working together to make sure the real-time users are not affected.
Hope you got a basic understanding of what chaos engineering is, and about the tools that can be used for it.
Happy Chaos Engineering :)
Following are my references, and I highly recommend these resources if you need to dig deep in Chaos Engineering.
Following is the reference video