登录查看更多内容

Chaos Engineering - Breaking Things On Purpose.

Maninder Narang

Cloud Sherpa ?? | DevOps Craftsman (Senior Edition) ?? | Shaping Efficient Automations | Blending Work & Passion

发布日期: 2019年4月7日

With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. We all depend on these systems more than ever, yet failures have become much harder to predict.

Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.

Similarly , do you know the answer to these questions: what will happen to your application if Amazon S3 — one of the most widely used Amazon Web Services — suddenly goes down? Are you confident that your website will continue serving requests if it fails to load assets from the cloud? What about your deployment system? Will you still be able to add capacity to your fleet? And don’t forget the other dozen or so backend services you’re running for file sharing, data analytics, secrets management, etc. that all depend on S3 to operate correctly. The question really becomes: is the distributed system you’ve built resilient enough to survive such an outage?

The truth is: you can never be sure. You don’t know what’s going to happen. There will always be something that can — and will — go wrong, from self-inflicted outages caused by bad configuration pushes or buggy images to events that are outside your control like denial-of-service attacks or network failures. No matter how hard you try, you can’t build perfect software (or hardware, for that matter). Nor can the companies you depend on

"The best way to avoid failure is to fail constantly."

This is where Chaos Engineering comes in. It says Rather than waiting for things to break in production at the worst time, the core idea of Chaos Engineering is to proactively inject failures in order to be prepared when disaster strikes.

Netflix went the extra mile and built several autonomous agents, so-called “monkeys”, for injecting failures and creating different kinds of outages. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption . Together they form the Simian Army.

Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment. Exposing engineers to failures more frequently incentivizes them to build resilient services.

Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.

Conclusions

After running a chaos experiment, there are typically two potential outcomes. You’ve verified your system is resilient to the introduced failure, or you’ve found a problem you need to fix. Both of these are good outcomes if the chaos experiment was first run on a staging environment. In the first case, you’ve increased your confidence in the system and its behaviour. In the other case, you’ve found a problem before it caused an outage.

Chaos Engineering is a tool to make your job easier. By proactively testing and validating your system’s failure modes you reduce your operational burden, increase your resiliency, and will sleep better at night.

Mahendra Hegde

5 年

Good one !

1 次回应

Prashant K.

Ex CheQ, Yubi, JODO

5 年

I thought that was my thing ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Maninder Narang的更多文章

Containers vs. Serverless: The Car You Own vs. the Uber You Ride

2024年6月4日

Containers vs. Serverless: The Car You Own vs. the Uber You Ride

In the dynamic world of software development, the shift towards microservices architectures represents a fundamental…
The 3-2-1 Backup Strategy: Your Ultimate Shield Against Data Loss

2024年5月25日

The 3-2-1 Backup Strategy: Your Ultimate Shield Against Data Loss

In a recent incident, UniSuper and Google Cloud faced a significant data disruption where an inadvertent…
Slicing Costs and Boosting Efficiency: How AWS RAM is the Secret Sauce in Cloud Resource Management

2024年5月1日

Slicing Costs and Boosting Efficiency: How AWS RAM is the Secret Sauce in Cloud Resource Management

Managing cloud resources across multiple accounts can feel like running a chain of pizza shops—each store needs…
Welcome to the World of GenAI: Simplifying AI for Everyday Life

2024年1月28日

Welcome to the World of GenAI: Simplifying AI for Everyday Life

Remember when calculators replaced slide rules? Or smartphones banished bulky maps? . As we navigate the 21st century…

4 条评论
The Art of Personalised AI: Training ChatGPT to Understand Your Data.

2024年1月7日

The Art of Personalised AI: Training ChatGPT to Understand Your Data.

Lately, I've been exploring something truly exciting in the realm of AI technology: the ability to train ChatGPT with…

7 条评论
Fortifying Your Online Presence: AWS WAF's Role in Digital Security

2023年12月9日

Fortifying Your Online Presence: AWS WAF's Role in Digital Security

In the digital era, the security of your online platform, whether it's an e-commerce site, a service portal, or a…
Slicing the S3 Bill :Crafting a Cost-Effective Amazon Storage Plan

2023年11月11日

Slicing the S3 Bill :Crafting a Cost-Effective Amazon Storage Plan

Struggling with soaring AWS S3 storage costs? You're not alone. Many team find themselves in a tight spot as their data…
Simplify with AWS: IAM Roles Anywhere

2023年11月4日

Simplify with AWS: IAM Roles Anywhere

In the Cloud landscape, managing who gets to access your resources is key to ensuring both safety and smooth…

1 条评论
GitHub Copilot: Your AI pair Programmer

2021年7月31日

GitHub Copilot: Your AI pair Programmer

Yes , you read it right . Now you can code in any language using this new tool powered by OpenAI , Github & Microsoft…

2 条评论
Production Operations Simplified : Automated Monitoring & Incident Orchestration

2019年12月14日

Production Operations Simplified : Automated Monitoring & Incident Orchestration

Micro services architectures have become increasingly popular in recent years, and for good reason: teams can deliver…

See all articles

Chaos Engineering - Breaking Things On Purpose.

Maninder Narang

Cloud Sherpa ?? | DevOps Craftsman (Senior Edition) ?? | Shaping Efficient Automations | Blending Work & Passion

Conclusions

Maninder Narang的更多文章

社区洞察

其他会员也浏览了

From Pods to Basics: Rediscovering the Foundations in a Kubernetes World

Architecture Weekly #122 - 10th April 2023

Rethinking the “DIY or Die” Mindset with Container Orchestration

Dealing with Performance Limits? Take an SRE Approach

CloudCast: Chaos Engineering, Critical Updates, and Emerging Innovations

What we shipped this week

Stability, Seamless Scaling, and Cost-Efficiency — Clients Want It All. Part 1: Stability

Day 4: Architecting for Resilience, Observability & Performance Optimization

Not Too Much, Not Too Little: Our Capacity Planning Strategies

SCORP - the well-architected tool for Architecture Reviews

Conclusions

Maninder Narang的更多文章

Containers vs. Serverless: The Car You Own vs. the Uber You Ride

The 3-2-1 Backup Strategy: Your Ultimate Shield Against Data Loss

Slicing Costs and Boosting Efficiency: How AWS RAM is the Secret Sauce in Cloud Resource Management

Welcome to the World of GenAI: Simplifying AI for Everyday Life

The Art of Personalised AI: Training ChatGPT to Understand Your Data.

Fortifying Your Online Presence: AWS WAF's Role in Digital Security

Slicing the S3 Bill :Crafting a Cost-Effective Amazon Storage Plan

Simplify with AWS: IAM Roles Anywhere

GitHub Copilot: Your AI pair Programmer

Production Operations Simplified : Automated Monitoring & Incident Orchestration

社区洞察

其他会员也浏览了

From Pods to Basics: Rediscovering the Foundations in a Kubernetes World

Architecture Weekly #122 - 10th April 2023

Rethinking the “DIY or Die” Mindset with Container Orchestration

Dealing with Performance Limits? Take an SRE Approach

CloudCast: Chaos Engineering, Critical Updates, and Emerging Innovations

What we shipped this week

Stability, Seamless Scaling, and Cost-Efficiency — Clients Want It All. Part 1: Stability

Day 4: Architecting for Resilience, Observability & Performance Optimization

Not Too Much, Not Too Little: Our Capacity Planning Strategies

SCORP - the well-architected tool for Architecture Reviews