Chaos with AWS
Sridhar C R
Senior Software Engineer | Python, JavaScript | AWS | Microservices | Design Patterns
Problem Statement
So after spending good amount of time in feature testing the big cloud application, most of us believe it would perform well in production, it might not perform well as expected / mentioned in SLA. Under ideal conditions it would perform well, but what if that's not the case.
Solution
We would build applications that are resilient for failures, efficient, less latency, integrity and self sustainable fashion. In order to build that, we would need chaos engineering to proceed with.
Bit of a background
What is Chaos Engineering / Testing?
"Chaos Engineering?is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."
Principles of Chaos Testing
The principles of chaos testing are as follows,
Origin Story
When Netflix was migrating to AWS cloud, they first came up with this intelligent testing strategy, which helped them to resolve issues and build a strong system.
In 2010, development and operations teams at Netflix started the process of moving their entire infrastructure over to AWS (Amazon Web Services). At the time, the team at Netflix quickly realized their existing infrastructure would not allow for the scalability that they’d eventually need, so they made the intimidating decision to migrate everything to Amazon’s cloud-based AWS in a monolith-to-microservice transition.
During this time, Netflix established two principles learned from the process of moving over their entire infrastructure while minimizing the impact to its millions of users:
领英推荐
Chaos Monkey: A resiliency tool that helps applications tolerate random instance failures.
AWS Fault Injection Simulator
As always, we have a AWS service which would help to do the chaos testing in our application if they were hosted in Amazon EC2, Amazon EKS, Amazon ECS, and Amazon RDS.
The process is very much streamlined as an experiment and the outcomes of the experiment are used to resolve the problems, and the same activity goes on until the application becomes resilient.
In the experiment template, AWS provides lot of actions to induce the chaos in our application, some of them are mentioned in the screenshot attached. It would be incredibly painful if these simulations are done manually, thanks to AWS.
Benefits of FIS