What's Chaos Engineering - Why is it so important?

What's Chaos Engineering - Why is it so important?

With the rise of Microservices, Cloud Architecture and Distributed Systems we now have a multiple services and components involved in every single company. Anywhere you see, people just talk about building "microservices", breaking down things into small components, using multiple cloud services, multiple technologies and frameworks. It's getting complicated and complicated day by day..!!

What happens when any failure happens in your systems? How can we predict failures beforehand? How costly are these failures? Let's see..!

No alt text provided for this image

Before going ahead, you can also check my Live Online Sessions on learning multiple tech stacks and System Design concepts and register for those..!

Lookback at Major Downtimes in Huge Companies

In December 2021, AWS us-east-1 datacenter had a HUGE outage, taking down almost 30% of internet with it..! As a result, services like Netflix, Twitter, Coinbase, BitBucket and so many more..!

Another famous incident is when one bug in the code stranded tens of thousands of British Airways (BA) passengers in May 2017?cost the company 80 million pounds?($102.19 million USD).

The most famous and recent downtime we saw was for Facebook, when their own Authentication System (User Login) system failed, which even prevented the developers to go into their Datacenter to fix the problem. LOL..! xD

No alt text provided for this image

As you see, not even a single component in this huge complex stack gives you 100% reliability and no matter how expensive hardware you get, it can eventually fail. And companies, even with a 1 hour downtime face so much hit on their revenue..!

Predicting Outages Even Before They Happen

Chaos Engineering, is a way of predicting failures even before they become huge outages..! By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. How..? Let's see.!

Chaos Engineering and its Approach

Chaos Engineering is a practise, in which we intentionally break our systems and see, how our system responds and mitigates to these failures. Let's take an example -

cascac

Imagine we are building a small application involving these components -

  1. Web application, Android App and an IOS app.
  2. Your API backend in any language.
  3. An application database - lets say MySQL.
  4. Deploying on AWS cloud - let's say on EC2.
  5. Using API gateway and Application Load Balancers.
  6. Using Kafka / AWS Queue services for some background tasks.
  7. And sooooo many more techs...!

Let's just pick one of these components to keep the example simple -- Our MySQL database. Imagine we have our MySQL database deployed as such -

No alt text provided for this image

We have our primary instance with 2 replicas in our AWS Region A, as well as a Pseudo Primary and Replicas in another AWS Region B.

In Chaos Engineering, we -

  1. First plan our experiment - What to target, how to target and so on...
  2. Contain the blast radius - Execute the smallest test that will teach you something - a bug, a failure, anything.
  3. Scale or squash - Find an issue yet? Job well done..! Otherwise increase the blast radius till you are at full scale.

No alt text provided for this image

Coming Back to Our Example

Let's say, we plan to target our MySQL cluster. We start small. First, let's close down one of our replica and see how the system responds. Not a big deal, our cluster removes the old instance and brings back another replica, copies data from the primary and system is back Live..!

Scaling our experiment

Once we make sure our replica failover is consistent, we know that the clone will occur but we don’t know the mean time it takes from experiencing a failure to adding a clone back to the cluster effectively. This is our second failure.

Once we fix that, we know we will get an alert if the cluster has only one replica after 5 minutes but we don’t know if our alerting threshold should be adjusted to more effectively prevent incidents.

Next, if we shutdown the two replicas for a cluster at the same time, we don’t know exactly the mean time during a Monday morning it would take us to clone two new replicas off the existing primary. But we do know we have a pseudo primary and two replicas which will also have the transactions.

Once we overcome all the failures above and test our systems for it, now we don’t know exactly what would happen if we shutdown an entire cluster in our main region, and we don’t know if the pseudo region would be able to failover effectively because we have not yet run this scenario.

Chaos Engineering is when you intentionally break your System components and see how effectively your system responds and manages the attack.

Other Scenarios

You can have multiple attacks planned - on your infra side, increasing the network latency, break in networking between clusters and systems, storage systems failure (s3, kafka, etc) and privacy related attacks such as DDOS attack and so much more..!

No alt text provided for this image

Which company does Chaos Engineering

Many big companies like Twilio, Netflix, LinkedIn, Facebook, Google, Microsoft, Amazon, and many others are already practising Chaos Engineering in their daily product development. The list is always growing with multiple new product companies and startups joining in..!

Conclusion

Never ever rely on any 1 component 100% in you system. No matter how large it is, how fault tolerant and how distributed. Always, make sure that they are actually consistent, your systems actually respond and mitigate the failures and does not affect the end users much..! Predict, prepare and always be ready..!

No alt text provided for this image

P.S for learning more such engineering stacks and concepts, you can check my Live Online Sessions and register for the same..!


Prakash Upadhyay

Full Stack Developer | React.js, Node.js, AWS | Patent Holder (USPTO) | Former Software Developer @ (Gojek, PayTM Insider, InkersAI, SAP Labs)

2 年

This is really interesting! What's better is, correct me if wrong, we can do chaos engineering while we're still building the App/Setting the infra.

回复

要查看或添加评论,请登录

Shrey Batra的更多文章

社区洞察

其他会员也浏览了