Have you got a Monkey?
Jo?o Bezerra Leite
“Small moves, smartly made, can set big things in motion.” (John Hagel)
I very well remember our first visits to Silicon Valley to learn more about Chaos Engineering (back in 2015, with Lineu Andrade). At that time, when asked about our practices, we kind of smiled and used to say: - "Half of the task is done, we’ve got the Chaos. Now we need to work on the Engineering!”
Since then, we have been studying, not just Chaos Engineering but also Resilience, and practicing a little bit of it, here at Itau. From what we have learned, this is our best advice:
“Test in Production and Chaos Engineering are not for beginners. It’s for pros.”
Chaos? What do you mean?
There are some interesting definitions for Chaos Engineering :
- Chaos Engineering is a strategy to learn about how your system behaves by conducting experiments to test for a reaction.
- Chaos Engineering is Preventive Medicine.
- Chaos Engineering is a “disciplined” approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news.
- Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production – from Principles of Chaos.
But, the definition I most like comes from Kolton Andrus, Gremlin’s CEO :
“Breaking things on purpose in order to build more resilient systems!”
Where did it come from?
With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. We all depend on these systems more than ever, yet failures have become much harder to predict.
These failures cause costly outages for companies. The outages hurt customers trying to shop, transact business, and get work done. Even brief outages can impact a company's bottom line, so the cost of downtime is becoming a KPI for many engineering teams. Waiting for the next costly outage is not an option. To meet the challenge head on, more and more companies are turning to Chaos Engineering.
Chaos Engineering lets you compare what you think will happen to what actually happens in your systems. You literally “break things on purpose”.
A little bit of history
This practice first became relevant at internet companies that were pioneering large scale distributed systems. These systems were so complex that they required a new approach to test for failure.
In 2010, Netflix created Chaos Monkey in response to its move from physical infrastructure to Amazon cloud infrastructure, with the objective of making sure that a loss of an Amazon instance wouldn’t affect the Netflix streaming experience.
In 2011, the Simian Army added additional failure injection modes on top of Chaos Monkey that would allow testing of a more complete suite of failure states, and thus build resilience to those as well.
In 2012, Netflix shared the source code for Chaos Monkey on Github, saying that they “have found that the best defense against major unexpected failures is to fail often". And in 2014, Netflix created the role of Chaos Engineer.
But, just Netflix?
Actually, many larger tech companies currently practice Chaos Engineering to better understand their distributed systems and microservice architectures.
Besides Netflix, the list includes Twilio, LinkedIn, Salesforce, Facebook, Google, Microsoft, GitHub, Amazon, Pivotal, Thoughworks, New Relic and and many others.
The list is always growing and more traditional industries, like finance and banking (Capital One, Visa, Fidelity, National Australia Bank, Itau) have caught on to Chaos Engineering, too.
The Principles of Chaos Enginnering
1. Build a Hypothesis around Steady State Behaviour - Focus on the measurable output of a system, rather than internal attributes of the system. By focusing on systemic behavior patterns during experiments, Chaos verifies that the system does work, rather than trying to validate how it works.
2. Vary Real-world Events - Chaos variables reflect real-world events. Prioritize events either by potential impact or estimated frequency. Consider events that correspond to hardware failures like servers dying,software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event. Any event capable of disrupting steady state is a potential variable in a Chaos experiment.
3. Run Experiments in Production -Systems behave differently depending on environment and traffic patterns. Chaos strongly prefers to experiment directly on production traffic.
4. Automate Experiments to Run Continuously - Chaos experiments should run continuously rather one time or periodic checks.
5. Minimise Blast Radius - Experimenting in production has the potential to cause unnecessary customer pain. So Chaos Engineer needs to make sure that the implications of the experiments should be manageable.
The principles in Chaos Engineering can also be applied in other aspects of development and operations. Consider the concept of "canary analysis": When new code is deployed, you can measure performance on a limited number of systems before deploying it more widely. In effect, canary analysis is a sanity check applied to staged software rollouts, with the benefit of performance logging. If your predetermined steady state does not fluctuate unnecessarily, it can be deployed widely. If the canary deployment exceeds your predetermined "error budget," it is withdrawn for further refinement to protect the integrity of the service
What about Testing and Resilience
By now you must have been wondering : - How Chaos Engineering is different from testing and resilience? First, you have to consider that these practices complement each other
Testing Vs Chaos Engineering
Most of the time a good testing plan talks about load testing, security testing, functionality testing at load. Unfortunately, we only do these tests on “Non-production” environments. We test only non-production environments and hope that system behaves the same on production environment. This is the place where Chaos Engineering tries to prepare us by doing experiments as close to the production environments or sometimes even on production environments. And, one of the main differences in testing and Chaos Engineering is the outcome. Chaos Engineering brings new knowledge about the system which even developers or testers might not be aware of.
Shamim Ahmed, CTO of Continuous Delivery at CA Technologies, provokes that Chaos Engineering and Negative Testing, for instance, have a close correlation. The same principles of Chaos could be used for testing bad data, unexpected scenarios and destructive tests.
Resilience Vs Chaos Engineering
Following a different path than Netflix, instead of starting straight to Chaos experiments, LinkedIn decided for resilience engineering efforts with Project Waterbear, that aimed to:
? ensure they run on a resilient cluster of resources,
? create or maintain robust infrastructure
? handle failures intelligently,
? gracefully degrade when required, and
? increase SRE happiness by designing self-healing systems.
And this is done via three software platforms they have built internally:
? FireDrill - provides an automated, systematic way to trigger/ simulate infrastructure failure in production, with the goal of helping build applications resistant to these failures.
? LinkedOut - a framework and tooling to test how user experience will degrade in different failure scenarios associated with downstream calls.
? D2 Tuner – analyses client server latency and error rates (and recommends degradation and timeout thressholds)
So Linkedin, although also adopts Chaos Engineering, has an approach to improve application resiliency, with a mindset of building failure-tolerant infrastructure, through Waterbear Tools. As says Brian Wilcox, Staff Site Reliability Engineer at Linkedin, “our goal is to help people to be successful, so our experiments should never impact our users".
In fact, in 2014, Netflix also announced Failure Injection Testing (FIT), a new tool built on the concepts of the Simian Army, but that gave developers more granular control over the “blast radius” of their failure injection. The Simian Army tools had been so effective that in some instances they created painful outages, causing many Netflix developers to grow wary of them. FIT gave developers control over the scope of their failure so they could realize the insights of Chaos Engineering, but mitigate potential downside.
What’s next?
Some managers may be hesitant to implement Chaos Engineering in their organizations, as the risks of failure are higher than that of Netflix. In the event something goes wrong with Netflix's network, the customer is inconvenienced by not having a video play.
But, imagine a bank !!! You really need extreme control on what you are doing. Usually, banks that have adopted Chaos Engineering prefer to experiment on pre-production or on very well controlled environments.
We must remember that Chaos Engineering means “shif-right” (taking tests and experiments to production). That is a practice to be massively adopted "just by successful companies in the shift-left” move (quality in the design) and very robust in Observability. Without strong monitoring and control (as FIT at Netflix), teams will not adopt Chaos Engineering. They will just be paged up more frequently to solve problems that they have generated themselves, as says Vivek Rau, Site Reliability Engineer Manager at Google Cloud.
As the adoption of Chaos Engineering is expanding fast, new start-ups, as Gremlin, will surf that wave supporting large companies to leverage this practice, speeding up knowledge and the utilization of automated tools. If you take the right journey, work with discipline and build control and monitoring around, Chaos Engineering and Resilience will be good partners of Continuous Testing.
“Train in the calm before the storm, so you will be calm in the storm”
(by Sathiya Shunmugasunda and Gnani Dathathreya)
Well, I never thought a chaos could be so helpful and bring such discipline. What about you? Have you already got a Monkey?
If you want to know more about Chaos Engineering, join us at our Chaos Engineering Sao Paulo Meetup Group, organized by Thiago Segantini, Andrea Cabe?a e Augusto Stracieri.
Here some interesting references:
3) The evolution of Chaos (Kolton Andrus)
4) Test in Production. A panel discussion on Chaos Engineering
Global Research Leader in Banking, IBM Institute for Business Value | Bestselling author | Podcaster | Board advisor | International speaker
3 年Very good reading, linking advanced “shift-left” as a precondition to afford to “shift-right”. I like this sentence in particular: “Chaos Engineering brings new knowledge about the system which even developers or testers might not be aware of.” It is our relationships to uncertainty that allows to make it “endogenous” to decision-making that’s means affording to make the investments to become anti-fragile.
SR Quality Assurance | SDET | AWS Certified | Datadog Certified Test Automation @Itau Unibanco
5 年What a great article ! Congrats, it's gave me fantastic Insights.
Coordenador Tech no Itaú | Gest?o de Portfolio & Produtos Digitais | Agilista | ProdOps | Investimentos | Dados & Analytics | A-CSM? | A-CSPO? | OKRCP? | MGT3.0 | KMP 1 | People Driven Management
5 年Que artigo completo! Muito bom!
Customer Reliability Engineer at Google
5 年This is a really clear and comprehensive article, and I recommend it to anyone interested in selling the value of Chaos Engineering to their own IT organization, both to SREs and to QA engineers.
IT Executive | IT Architecture | IT Infrastructure | SR Director | CTO
5 年Muito bom !!!!