登录查看更多内容

Have you got a Monkey?

Jo?o Bezerra Leite

“Small moves, smartly made, can set big things in motion.” (John Hagel)

发布日期: 2019年3月24日

I very well remember our first visits to Silicon Valley to learn more about Chaos Engineering (back in 2015, with Lineu Andrade). At that time, when asked about our practices, we kind of smiled and used to say: - "Half of the task is done, we’ve got the Chaos. Now we need to work on the Engineering!”

Since then, we have been studying, not just Chaos Engineering but also Resilience, and practicing a little bit of it, here at Itau. From what we have learned, this is our best advice:

“Test in Production and Chaos Engineering are not for beginners. It’s for pros.”

Chaos? What do you mean?

There are some interesting definitions for Chaos Engineering :

Chaos Engineering is a strategy to learn about how your system behaves by conducting experiments to test for a reaction.
Chaos Engineering is Preventive Medicine.
Chaos Engineering is a “disciplined” approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news.
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production – from Principles of Chaos.

But, the definition I most like comes from Kolton Andrus, Gremlin’s CEO :

N?o foi fornecido texto alternativo para esta imagem

“Breaking things on purpose in order to build more resilient systems!”

Where did it come from?

With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. We all depend on these systems more than ever, yet failures have become much harder to predict.

These failures cause costly outages for companies. The outages hurt customers trying to shop, transact business, and get work done. Even brief outages can impact a company's bottom line, so the cost of downtime is becoming a KPI for many engineering teams. Waiting for the next costly outage is not an option. To meet the challenge head on, more and more companies are turning to Chaos Engineering.

Chaos Engineering lets you compare what you think will happen to what actually happens in your systems. You literally “break things on purpose”.

A little bit of history

This practice first became relevant at internet companies that were pioneering large scale distributed systems. These systems were so complex that they required a new approach to test for failure.

In 2010, Netflix created Chaos Monkey in response to its move from physical infrastructure to Amazon cloud infrastructure, with the objective of making sure that a loss of an Amazon instance wouldn’t affect the Netflix streaming experience.

In 2011, the Simian Army added additional failure injection modes on top of Chaos Monkey that would allow testing of a more complete suite of failure states, and thus build resilience to those as well.

In 2012, Netflix shared the source code for Chaos Monkey on Github, saying that they “have found that the best defense against major unexpected failures is to fail often". And in 2014, Netflix created the role of Chaos Engineer.

But, just Netflix?

Actually, many larger tech companies currently practice Chaos Engineering to better understand their distributed systems and microservice architectures.

Besides Netflix, the list includes Twilio, LinkedIn, Salesforce, Facebook, Google, Microsoft, GitHub, Amazon, Pivotal, Thoughworks, New Relic and and many others.

The list is always growing and more traditional industries, like finance and banking (Capital One, Visa, Fidelity, National Australia Bank, Itau) have caught on to Chaos Engineering, too.

The Principles of Chaos Enginnering

1. Build a Hypothesis around Steady State Behaviour - Focus on the measurable output of a system, rather than internal attributes of the system. By focusing on systemic behavior patterns during experiments, Chaos verifies that the system does work, rather than trying to validate how it works.

2. Vary Real-world Events - Chaos variables reflect real-world events. Prioritize events either by potential impact or estimated frequency. Consider events that correspond to hardware failures like servers dying,software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event. Any event capable of disrupting steady state is a potential variable in a Chaos experiment.

3. Run Experiments in Production -Systems behave differently depending on environment and traffic patterns. Chaos strongly prefers to experiment directly on production traffic.

4. Automate Experiments to Run Continuously - Chaos experiments should run continuously rather one time or periodic checks.

5. Minimise Blast Radius - Experimenting in production has the potential to cause unnecessary customer pain. So Chaos Engineer needs to make sure that the implications of the experiments should be manageable.

The principles in Chaos Engineering can also be applied in other aspects of development and operations. Consider the concept of "canary analysis": When new code is deployed, you can measure performance on a limited number of systems before deploying it more widely. In effect, canary analysis is a sanity check applied to staged software rollouts, with the benefit of performance logging. If your predetermined steady state does not fluctuate unnecessarily, it can be deployed widely. If the canary deployment exceeds your predetermined "error budget," it is withdrawn for further refinement to protect the integrity of the service

What about Testing and Resilience

By now you must have been wondering : - How Chaos Engineering is different from testing and resilience? First, you have to consider that these practices complement each other

Testing Vs Chaos Engineering

Most of the time a good testing plan talks about load testing, security testing, functionality testing at load. Unfortunately, we only do these tests on “Non-production” environments. We test only non-production environments and hope that system behaves the same on production environment. This is the place where Chaos Engineering tries to prepare us by doing experiments as close to the production environments or sometimes even on production environments. And, one of the main differences in testing and Chaos Engineering is the outcome. Chaos Engineering brings new knowledge about the system which even developers or testers might not be aware of.

Shamim Ahmed, CTO of Continuous Delivery at CA Technologies, provokes that Chaos Engineering and Negative Testing, for instance, have a close correlation. The same principles of Chaos could be used for testing bad data, unexpected scenarios and destructive tests.

Resilience Vs Chaos Engineering

Following a different path than Netflix, instead of starting straight to Chaos experiments, LinkedIn decided for resilience engineering efforts with Project Waterbear, that aimed to:

? ensure they run on a resilient cluster of resources,

? create or maintain robust infrastructure

? handle failures intelligently,

? gracefully degrade when required, and

? increase SRE happiness by designing self-healing systems.

And this is done via three software platforms they have built internally:

? FireDrill - provides an automated, systematic way to trigger/ simulate infrastructure failure in production, with the goal of helping build applications resistant to these failures.

? LinkedOut - a framework and tooling to test how user experience will degrade in different failure scenarios associated with downstream calls.

? D2 Tuner – analyses client server latency and error rates (and recommends degradation and timeout thressholds)

So Linkedin, although also adopts Chaos Engineering, has an approach to improve application resiliency, with a mindset of building failure-tolerant infrastructure, through Waterbear Tools. As says Brian Wilcox, Staff Site Reliability Engineer at Linkedin, “our goal is to help people to be successful, so our experiments should never impact our users".

In fact, in 2014, Netflix also announced Failure Injection Testing (FIT), a new tool built on the concepts of the Simian Army, but that gave developers more granular control over the “blast radius” of their failure injection. The Simian Army tools had been so effective that in some instances they created painful outages, causing many Netflix developers to grow wary of them. FIT gave developers control over the scope of their failure so they could realize the insights of Chaos Engineering, but mitigate potential downside.

What’s next?

Some managers may be hesitant to implement Chaos Engineering in their organizations, as the risks of failure are higher than that of Netflix. In the event something goes wrong with Netflix's network, the customer is inconvenienced by not having a video play.

But, imagine a bank !!! You really need extreme control on what you are doing. Usually, banks that have adopted Chaos Engineering prefer to experiment on pre-production or on very well controlled environments.

We must remember that Chaos Engineering means “shif-right” (taking tests and experiments to production). That is a practice to be massively adopted "just by successful companies in the shift-left” move (quality in the design) and very robust in Observability. Without strong monitoring and control (as FIT at Netflix), teams will not adopt Chaos Engineering. They will just be paged up more frequently to solve problems that they have generated themselves, as says Vivek Rau, Site Reliability Engineer Manager at Google Cloud.

As the adoption of Chaos Engineering is expanding fast, new start-ups, as Gremlin, will surf that wave supporting large companies to leverage this practice, speeding up knowledge and the utilization of automated tools. If you take the right journey, work with discipline and build control and monitoring around, Chaos Engineering and Resilience will be good partners of Continuous Testing.

“Train in the calm before the storm, so you will be calm in the storm”

(by Sathiya Shunmugasunda and Gnani Dathathreya)

Well, I never thought a chaos could be so helpful and bring such discipline. What about you? Have you already got a Monkey?

If you want to know more about Chaos Engineering, join us at our Chaos Engineering Sao Paulo Meetup Group, organized by Thiago Segantini, Andrea Cabe?a e Augusto Stracieri.

Here some interesting references:

1) Principles of Chaos

2) What is Chaos Engineering?

3) The evolution of Chaos (Kolton Andrus)

4) Test in Production. A panel discussion on Chaos Engineering

5) Continuous Chaos in DevOps - Capital One

Paolo Sironi

Global Research Leader in Banking, IBM Institute for Business Value | Bestselling author | Podcaster | Board advisor | International speaker

3 年

Very good reading, linking advanced “shift-left” as a precondition to afford to “shift-right”. I like this sentence in particular: “Chaos Engineering brings new knowledge about the system which even developers or testers might not be aware of.” It is our relationships to uncertainty that allows to make it “endogenous” to decision-making that’s means affording to make the investments to become anti-fragile.

1 次回应

Diogo Macedo

SR Quality Assurance | SDET | AWS Certified | Datadog Certified Test Automation @Itau Unibanco

5 年

What a great article ! Congrats, it's gave me fantastic Insights.

1 次回应

Marcos Cardoso Regado

5 年

Que artigo completo! Muito bom!

1 次回应

Vivek Rau

Customer Reliability Engineer at Google

5 年

This is a really clear and comprehensive article, and I recommend it to anyone interested in selling the value of Chaos Engineering to their own IT organization, both to SREs and to QA engineers.

2 次回应

Marcos Rodrigues

IT Executive | IT Architecture | IT Infrastructure | SR Director | CTO

5 年

Muito bom !!!!

2 次回应

查看更多评论

要查看或添加评论，请登录

Jo?o Bezerra Leite的更多文章

Um Conselho Valioso

2024年12月16日

Um Conselho Valioso

Em 1991, eu era o responsável pela opera??o da rede VSAT, a rede via satélite para transmiss?o de dados, voz, vídeo e…

71 条评论
Minhas 11 dicas para que o pitch de sua startup conven?a o investidor

2024年8月6日

Minhas 11 dicas para que o pitch de sua startup conven?a o investidor

Nesses últimos 5 anos, devo ter conversado com mais de 300 startups, na sua maioria early-stage, que buscavam por…

38 条评论
MInhas 10 dicas de ítens essenciais para a agenda do CIO de Bancos e Fintechs no Brasil ????

2024年6月24日

MInhas 10 dicas de ítens essenciais para a agenda do CIO de Bancos e Fintechs no Brasil ????

Em tempos de Febrabantech, onde já contamos com mais de 800 institui??es financeiras circulando no Open Finance e cerca…

48 条评论
Aqueles 5 lembretes que o líder inclusivo n?o pode esquecer?????????????♀???

2024年6月3日

Aqueles 5 lembretes que o líder inclusivo n?o pode esquecer?????????????♀???

A lideran?a inclusiva acontece quando a gente consegue aproveitar de verdade "a for?a que vem da diferen?a". E hoje já…

3 条评论
A 8 habilidades complementares do CEO e do COO que podem fazer toda a diferen?a

2024年5月2日

A 8 habilidades complementares do CEO e do COO que podem fazer toda a diferen?a

Há poucos dias, em uma reuni?o de conselho, discutimos muito a importancia do CEO (Chief Executive Officer) contar com…

31 条评论
As 5 competências para o líder de tecnologia de bancos e de fintechs que n?o quer perder o bonde no Brasil ????

2024年4月1日

As 5 competências para o líder de tecnologia de bancos e de fintechs que n?o quer perder o bonde no Brasil ????

Esse texto é dedicado a você que um dia me perguntou naquele café quais s?o as competências básicas que um líder de…

12 条评论
Minhas 11 dicas para mostrar que a vida aos 60+ pode ser melhor do que voce pensa

2024年3月4日

Minhas 11 dicas para mostrar que a vida aos 60+ pode ser melhor do que voce pensa

No mês em que completo mais uma de muitas primaveras, quero compartilhar minhas onze dicas que mostram como a vida pode…

106 条评论
Minhas 6 dicas da vida real para você que quer se tornar um investidor anjo

2024年2月1日

Minhas 6 dicas da vida real para você que quer se tornar um investidor anjo

Depois de 36 anos no mundo corporativo, mergulhei nos últimos 5 anos no mundo das startups e do investimento anjo. E as…

11 条评论
Quinze perguntas que você precisa fazer a si mesmo caso pretenda se tornar um bom Conselheiro em 2024

2024年1月2日

Quinze perguntas que você precisa fazer a si mesmo caso pretenda se tornar um bom Conselheiro em 2024

Em fun??o de minha transi??o profissional nos últimos anos, muita gente me procura querendo dicas de como se preparar…

58 条评论
Papo de Fintech / Coletanea 2020

2020年12月28日

Papo de Fintech / Coletanea 2020

Coletanea dos 10 artigos do "Papo de Fintech" em 2020 (clique na imagem) Feliz 2021 !!!

4 条评论

See all articles

Have you got a Monkey?

Jo?o Bezerra Leite

“Small moves, smartly made, can set big things in motion.” (John Hagel)

Chaos? What do you mean?

Where did it come from?

A little bit of history

But, just Netflix?

The Principles of Chaos Enginnering

What about Testing and Resilience

What’s next?

Jo?o Bezerra Leite的更多文章

社区洞察

其他会员也浏览了

Platform Engineering Conferences in 2025

Spaghetti code originates from bad engineering leadership

Scaling Engineering Culture with SRE and Observability

Building Data-Driven Engineering Teams!

What is Chaos Engineering

Chaos Engineering: Taming Complexity in the $3.9 Billion Resilience Revolution

The Hype About Platform Engineering: Echoes of the SRE Revolution

Transform People and Practices to Become a World-Class Digital Engineering Organization

#19: What will platform engineering look like in 2023? ??

The Cultural Shift in Engineering: SRE as a Change Agent

Chaos? What do you mean?

Where did it come from?

A little bit of history

But, just Netflix?

The Principles of Chaos Enginnering

What about Testing and Resilience

What’s next?

Jo?o Bezerra Leite的更多文章

Um Conselho Valioso

Minhas 11 dicas para que o pitch de sua startup conven?a o investidor

MInhas 10 dicas de ítens essenciais para a agenda do CIO de Bancos e Fintechs no Brasil ????

Aqueles 5 lembretes que o líder inclusivo n?o pode esquecer?????????????♀???

A 8 habilidades complementares do CEO e do COO que podem fazer toda a diferen?a

As 5 competências para o líder de tecnologia de bancos e de fintechs que n?o quer perder o bonde no Brasil ????

Minhas 11 dicas para mostrar que a vida aos 60+ pode ser melhor do que voce pensa

Minhas 6 dicas da vida real para você que quer se tornar um investidor anjo

Quinze perguntas que você precisa fazer a si mesmo caso pretenda se tornar um bom Conselheiro em 2024

Papo de Fintech / Coletanea 2020

社区洞察

其他会员也浏览了

Platform Engineering Conferences in 2025

Spaghetti code originates from bad engineering leadership

Scaling Engineering Culture with SRE and Observability

Building Data-Driven Engineering Teams!

What is Chaos Engineering

Chaos Engineering: Taming Complexity in the $3.9 Billion Resilience Revolution

The Hype About Platform Engineering: Echoes of the SRE Revolution

Transform People and Practices to Become a World-Class Digital Engineering Organization

#19: What will platform engineering look like in 2023? ??

The Cultural Shift in Engineering: SRE as a Change Agent