What's Chaos Engineering - Why is it so important?
Shrey Batra
CEO @ Cosmocloud | Ex-LinkedIn | Angel Investor | MongoDB Champion | Book Author | Patent Holder (Distributed Algorithms)
With the rise of Microservices, Cloud Architecture and Distributed Systems we now have a multiple services and components involved in every single company. Anywhere you see, people just talk about building "microservices", breaking down things into small components, using multiple cloud services, multiple technologies and frameworks. It's getting complicated and complicated day by day..!!
What happens when any failure happens in your systems? How can we predict failures beforehand? How costly are these failures? Let's see..!
Before going ahead, you can also check my Live Online Sessions on learning multiple tech stacks and System Design concepts and register for those..!
Lookback at Major Downtimes in Huge Companies
In December 2021, AWS us-east-1 datacenter had a HUGE outage, taking down almost 30% of internet with it..! As a result, services like Netflix, Twitter, Coinbase, BitBucket and so many more..!
Another famous incident is when one bug in the code stranded tens of thousands of British Airways (BA) passengers in May 2017?cost the company 80 million pounds?($102.19 million USD).
The most famous and recent downtime we saw was for Facebook, when their own Authentication System (User Login) system failed, which even prevented the developers to go into their Datacenter to fix the problem. LOL..! xD
As you see, not even a single component in this huge complex stack gives you 100% reliability and no matter how expensive hardware you get, it can eventually fail. And companies, even with a 1 hour downtime face so much hit on their revenue..!
Predicting Outages Even Before They Happen
Chaos Engineering, is a way of predicting failures even before they become huge outages..! By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. How..? Let's see.!
Chaos Engineering and its Approach
Chaos Engineering is a practise, in which we intentionally break our systems and see, how our system responds and mitigates to these failures. Let's take an example -
Imagine we are building a small application involving these components -
Let's just pick one of these components to keep the example simple -- Our MySQL database. Imagine we have our MySQL database deployed as such -
We have our primary instance with 2 replicas in our AWS Region A, as well as a Pseudo Primary and Replicas in another AWS Region B.
In Chaos Engineering, we -
领英推荐
Coming Back to Our Example
Let's say, we plan to target our MySQL cluster. We start small. First, let's close down one of our replica and see how the system responds. Not a big deal, our cluster removes the old instance and brings back another replica, copies data from the primary and system is back Live..!
Scaling our experiment
Once we make sure our replica failover is consistent, we know that the clone will occur but we don’t know the mean time it takes from experiencing a failure to adding a clone back to the cluster effectively. This is our second failure.
Once we fix that, we know we will get an alert if the cluster has only one replica after 5 minutes but we don’t know if our alerting threshold should be adjusted to more effectively prevent incidents.
Next, if we shutdown the two replicas for a cluster at the same time, we don’t know exactly the mean time during a Monday morning it would take us to clone two new replicas off the existing primary. But we do know we have a pseudo primary and two replicas which will also have the transactions.
Once we overcome all the failures above and test our systems for it, now we don’t know exactly what would happen if we shutdown an entire cluster in our main region, and we don’t know if the pseudo region would be able to failover effectively because we have not yet run this scenario.
Chaos Engineering is when you intentionally break your System components and see how effectively your system responds and manages the attack.
Other Scenarios
You can have multiple attacks planned - on your infra side, increasing the network latency, break in networking between clusters and systems, storage systems failure (s3, kafka, etc) and privacy related attacks such as DDOS attack and so much more..!
Which company does Chaos Engineering
Many big companies like Twilio, Netflix, LinkedIn, Facebook, Google, Microsoft, Amazon, and many others are already practising Chaos Engineering in their daily product development. The list is always growing with multiple new product companies and startups joining in..!
Conclusion
Never ever rely on any 1 component 100% in you system. No matter how large it is, how fault tolerant and how distributed. Always, make sure that they are actually consistent, your systems actually respond and mitigate the failures and does not affect the end users much..! Predict, prepare and always be ready..!
P.S for learning more such engineering stacks and concepts, you can check my Live Online Sessions and register for the same..!
Full Stack Developer | React.js, Node.js, AWS | Patent Holder (USPTO) | Former Software Developer @ (Gojek, PayTM Insider, InkersAI, SAP Labs)
2 年This is really interesting! What's better is, correct me if wrong, we can do chaos engineering while we're still building the App/Setting the infra.