Chaos Engineering - Breaking Things On Purpose.

Chaos Engineering - Breaking Things On Purpose.

With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. We all depend on these systems more than ever, yet failures have become much harder to predict.

Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.

Similarly , do you know the answer to these questions: what will happen to your application if Amazon S3 — one of the most widely used Amazon Web Services — suddenly goes down? Are you confident that your website will continue serving requests if it fails to load assets from the cloud? What about your deployment system? Will you still be able to add capacity to your fleet? And don’t forget the other dozen or so backend services you’re running for file sharing, data analytics, secrets management, etc. that all depend on S3 to operate correctly. The question really becomes: is the distributed system you’ve built resilient enough to survive such an outage?

The truth is: you can never be sure. You don’t know what’s going to happen. There will always be something that can — and will — go wrong, from self-inflicted outages caused by bad configuration pushes or buggy images to events that are outside your control like denial-of-service attacks or network failures. No matter how hard you try, you can’t build perfect software (or hardware, for that matter). Nor can the companies you depend on

"The best way to avoid failure is to fail constantly."

This is where Chaos Engineering comes in. It says Rather than waiting for things to break in production at the worst time, the core idea of Chaos Engineering is to proactively inject failures in order to be prepared when disaster strikes.

Netflix went the extra mile and built several autonomous agents, so-called “monkeys”, for injecting failures and creating different kinds of outages. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption . Together they form the Simian Army.

No alt text provided for this image

Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment. Exposing engineers to failures more frequently incentivizes them to build resilient services.

Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.

Conclusions

After running a chaos experiment, there are typically two potential outcomes. You’ve verified your system is resilient to the introduced failure, or you’ve found a problem you need to fix. Both of these are good outcomes if the chaos experiment was first run on a staging environment. In the first case, you’ve increased your confidence in the system and its behaviour. In the other case, you’ve found a problem before it caused an outage.

Chaos Engineering is a tool to make your job easier. By proactively testing and validating your system’s failure modes you reduce your operational burden, increase your resiliency, and will sleep better at night.


Mahendra Hegde

Data Engineer - Streaming Data Ingestion and Processing | NiFi | Flink | Kafka | Spark | Airflow | Python |AWS Cloud | Java | SQL | Delta | Rclone

5 年

Good one !

Prashant K.

Ex CheQ, Yubi, JODO

5 年

I thought that was my thing ??

要查看或添加评论,请登录

Maninder Narang的更多文章

社区洞察

其他会员也浏览了