登录查看更多内容

What's Chaos Engineering - Why is it so important?

Shrey Batra

CEO @ Cosmocloud | Ex-LinkedIn | Angel Investor | MongoDB Champion | Book Author | Patent Holder (Distributed Algorithms)

发布日期: 2022年1月28日

With the rise of Microservices, Cloud Architecture and Distributed Systems we now have a multiple services and components involved in every single company. Anywhere you see, people just talk about building "microservices", breaking down things into small components, using multiple cloud services, multiple technologies and frameworks. It's getting complicated and complicated day by day..!!

What happens when any failure happens in your systems? How can we predict failures beforehand? How costly are these failures? Let's see..!

Before going ahead, you can also check my Live Online Sessions on learning multiple tech stacks and System Design concepts and register for those..!

Lookback at Major Downtimes in Huge Companies

In December 2021, AWS us-east-1 datacenter had a HUGE outage, taking down almost 30% of internet with it..! As a result, services like Netflix, Twitter, Coinbase, BitBucket and so many more..!

Another famous incident is when one bug in the code stranded tens of thousands of British Airways (BA) passengers in May 2017?cost the company 80 million pounds?($102.19 million USD).

The most famous and recent downtime we saw was for Facebook, when their own Authentication System (User Login) system failed, which even prevented the developers to go into their Datacenter to fix the problem. LOL..! xD

As you see, not even a single component in this huge complex stack gives you 100% reliability and no matter how expensive hardware you get, it can eventually fail. And companies, even with a 1 hour downtime face so much hit on their revenue..!

Predicting Outages Even Before They Happen

Chaos Engineering, is a way of predicting failures even before they become huge outages..! By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. How..? Let's see.!

Chaos Engineering and its Approach

Chaos Engineering is a practise, in which we intentionally break our systems and see, how our system responds and mitigates to these failures. Let's take an example -

Imagine we are building a small application involving these components -

Web application, Android App and an IOS app.
Your API backend in any language.
An application database - lets say MySQL.
Deploying on AWS cloud - let's say on EC2.
Using API gateway and Application Load Balancers.
Using Kafka / AWS Queue services for some background tasks.
And sooooo many more techs...!

Let's just pick one of these components to keep the example simple -- Our MySQL database. Imagine we have our MySQL database deployed as such -

We have our primary instance with 2 replicas in our AWS Region A, as well as a Pseudo Primary and Replicas in another AWS Region B.

In Chaos Engineering, we -

领英推荐

Docker Architecture - Detailed Explanation

ITcare 2 年前

?? DevOps Weekly #438: Kubernetes v1.31, The Hidden…

FAUN - Developer Community 6 个月前

Kubernetes vs. Docker: Differences You Need to Know

Cprime, Inc 2 年前

First plan our experiment - What to target, how to target and so on...
Contain the blast radius - Execute the smallest test that will teach you something - a bug, a failure, anything.
Scale or squash - Find an issue yet? Job well done..! Otherwise increase the blast radius till you are at full scale.

Coming Back to Our Example

Let's say, we plan to target our MySQL cluster. We start small. First, let's close down one of our replica and see how the system responds. Not a big deal, our cluster removes the old instance and brings back another replica, copies data from the primary and system is back Live..!

Scaling our experiment

Once we make sure our replica failover is consistent, we know that the clone will occur but we don’t know the mean time it takes from experiencing a failure to adding a clone back to the cluster effectively. This is our second failure.

Once we fix that, we know we will get an alert if the cluster has only one replica after 5 minutes but we don’t know if our alerting threshold should be adjusted to more effectively prevent incidents.

Next, if we shutdown the two replicas for a cluster at the same time, we don’t know exactly the mean time during a Monday morning it would take us to clone two new replicas off the existing primary. But we do know we have a pseudo primary and two replicas which will also have the transactions.

Once we overcome all the failures above and test our systems for it, now we don’t know exactly what would happen if we shutdown an entire cluster in our main region, and we don’t know if the pseudo region would be able to failover effectively because we have not yet run this scenario.

Chaos Engineering is when you intentionally break your System components and see how effectively your system responds and manages the attack.

Other Scenarios

You can have multiple attacks planned - on your infra side, increasing the network latency, break in networking between clusters and systems, storage systems failure (s3, kafka, etc) and privacy related attacks such as DDOS attack and so much more..!

Which company does Chaos Engineering

Many big companies like Twilio, Netflix, LinkedIn, Facebook, Google, Microsoft, Amazon, and many others are already practising Chaos Engineering in their daily product development. The list is always growing with multiple new product companies and startups joining in..!

Conclusion

Never ever rely on any 1 component 100% in you system. No matter how large it is, how fault tolerant and how distributed. Always, make sure that they are actually consistent, your systems actually respond and mitigate the failures and does not affect the end users much..! Predict, prepare and always be ready..!

P.S for learning more such engineering stacks and concepts, you can check my Live Online Sessions and register for the same..!

System Design & Architecture

49,145 位关注者

Prakash Upadhyay

Full Stack Developer | React.js, Node.js, AWS | Patent Holder (USPTO) | Former Software Developer @ (Gojek, PayTM Insider, InkersAI, SAP Labs)

2 年

This is really interesting! What's better is, correct me if wrong, we can do chaos engineering while we're still building the App/Setting the infra.

要查看或添加评论，请登录

Shrey Batra的更多文章

How to break a system in Microservices - The invalid myths and the best practises

2025年1月31日

How to break a system in Microservices - The invalid myths and the best practises

People often think that 1 Microservice is responsible for 1 feature. And this is how you create the most inefficient…

5 条评论
How to be a SENIOR / STAFF engineer and highlight your impact?

2025年1月23日

How to be a SENIOR / STAFF engineer and highlight your impact?

How do you grow, other than learning new coding skills? You need much more to be a SENIOR engineer !! These concepts…

3 条评论
Cosmocloud Deploy - Managed Deployments cheaper than raw VMs / EC2

2025年1月20日

Cosmocloud Deploy - Managed Deployments cheaper than raw VMs / EC2

There was something cooking in Cosmocloud Labs, and finally it is out! Very happy to share that Cosmocloud Deploy is…
Using Redis as a Notification Service?

2024年12月16日

Using Redis as a Notification Service?

Only with Production Experience you can know that Redis can also be used as a notification system between multiple…

6 条评论
E03 - Finding the best Devops & PaaS Platforms - Azure App Services / Container Apps

2024年12月1日

E03 - Finding the best Devops & PaaS Platforms - Azure App Services / Container Apps

Under the new series of "Devops & PaaS Platforms", I am evaluating various different platforms on how easy it is to…

5 条评论
E02 - Finding the best Devops & PaaS Platforms - AWS ECS

2024年10月24日

E02 - Finding the best Devops & PaaS Platforms - AWS ECS

Under the new series of "Devops & PaaS Platforms", I am evaluating various different platforms on how easy it is to…

4 条评论
SMILe and Cosmocloud partners together: Transforming Logistics with Tech-Driven Operations

2024年9月19日

SMILe and Cosmocloud partners together: Transforming Logistics with Tech-Driven Operations

In today's fast paced world, technology has become the backbone of successful logistics operations, and at SMILe, we…

5 条评论
Building a Custom Link-Clicks Tracking System

2024年8月23日

Building a Custom Link-Clicks Tracking System

Last blog we saw how to create your own Event Tracking System, where we saw how we can track our own Page Views and…
Databases & Platform Mentorship Program

2024年8月21日

Databases & Platform Mentorship Program

Program Overview This exclusive Databases Mentorship Program will be a Hands-On Guided Mentorship and learning program…

1 条评论
Building your own Event Tracking System

2024年8月10日

Building your own Event Tracking System

When we are building any website or app, we often tend to use an analytics tool to track the usage of our website or…

5 条评论

See all articles

What's Chaos Engineering - Why is it so important?

Shrey Batra

CEO @ Cosmocloud | Ex-LinkedIn | Angel Investor | MongoDB Champion | Book Author | Patent Holder (Distributed Algorithms)

Lookback at Major Downtimes in Huge Companies

Predicting Outages Even Before They Happen

Chaos Engineering and its Approach

领英推荐

Coming Back to Our Example

Scaling our experiment

Other Scenarios

Which company does Chaos Engineering

Conclusion

System Design & Architecture

49,145 位关注者

Shrey Batra的更多文章

社区洞察

其他会员也浏览了

Deep Dive into Terraform Modules and Best Practices

?? DevSecOps Weekly #374: Unlocking Cloud Security Potential: Interoperability, Automation, and AI-Driven Solutions

The 1st Universal Infrastructure As Code is here! Wait... What Does That Mean?

Docker: Not Just for Building Containers – Unlocking Advanced Features for Modern Applications

NET Aspire 9.0 - Complete R&D Deployment ??

AWS DevOps: A Comprehensive Guide to Tools and Services

AWS CodeDeploy | DevOps With AWS Part 3

Declarative Observability: Applying GitOps Principles to Monitoring and Tracing

Understanding Chaos Engineering

Complete Guide to Setting Up Jenkins Pipeline with ECR, Kubernetes, and Ingress Controller

Lookback at Major Downtimes in Huge Companies

Predicting Outages Even Before They Happen

Chaos Engineering and its Approach

领英推荐

Coming Back to Our Example

Scaling our experiment

Other Scenarios

Which company does Chaos Engineering

Conclusion

System Design & Architecture

49,145 位关注者

Shrey Batra的更多文章

How to break a system in Microservices - The invalid myths and the best practises

How to be a SENIOR / STAFF engineer and highlight your impact?

Cosmocloud Deploy - Managed Deployments cheaper than raw VMs / EC2

Using Redis as a Notification Service?

E03 - Finding the best Devops & PaaS Platforms - Azure App Services / Container Apps

E02 - Finding the best Devops & PaaS Platforms - AWS ECS

SMILe and Cosmocloud partners together: Transforming Logistics with Tech-Driven Operations

Building a Custom Link-Clicks Tracking System

Databases & Platform Mentorship Program

Building your own Event Tracking System

社区洞察

其他会员也浏览了

Deep Dive into Terraform Modules and Best Practices

?? DevSecOps Weekly #374: Unlocking Cloud Security Potential: Interoperability, Automation, and AI-Driven Solutions

The 1st Universal Infrastructure As Code is here! Wait... What Does That Mean?

Docker: Not Just for Building Containers – Unlocking Advanced Features for Modern Applications

NET Aspire 9.0 - Complete R&D Deployment ??

AWS DevOps: A Comprehensive Guide to Tools and Services

AWS CodeDeploy | DevOps With AWS Part 3

Declarative Observability: Applying GitOps Principles to Monitoring and Tracing

Understanding Chaos Engineering

Complete Guide to Setting Up Jenkins Pipeline with ECR, Kubernetes, and Ingress Controller