Chaos with AWS

Chaos with AWS

Problem Statement

So after spending good amount of time in feature testing the big cloud application, most of us believe it would perform well in production, it might not perform well as expected / mentioned in SLA. Under ideal conditions it would perform well, but what if that's not the case.

Solution

We would build applications that are resilient for failures, efficient, less latency, integrity and self sustainable fashion. In order to build that, we would need chaos engineering to proceed with.

Bit of a background

What is Chaos Engineering / Testing?

"Chaos Engineering?is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."

Principles of Chaos Testing

The principles of chaos testing are as follows,

  1. Ensure your system works and define a steady state: To start with we need to define a steady state, where the system works as expected.
  2. Hypothesize the system’s steady state will hold: Once a steady state has been determined, it must be hypothesized that it will continue in both control and experimental conditions.
  3. Ensure minimal impact to your users: ?During chaos testing, the goal is to actively try to break or disrupt the system, but it’s important to do so in a way that minimizes the blast radius and any negative impact to your users.?
  4. Introduce chaos: Once you are confident that your system is working, your team is prepared, and the blast radius is contained, you can start running your chaos testing applications.
  5. Monitor and repeat: With chaos engineering, the key is to test consistently, introducing chaos to pinpoint any weaknesses within your system. This process would start sustaining like a negative feedback loop to build strong system.

Origin Story

When Netflix was migrating to AWS cloud, they first came up with this intelligent testing strategy, which helped them to resolve issues and build a strong system.

In 2010, development and operations teams at Netflix started the process of moving their entire infrastructure over to AWS (Amazon Web Services). At the time, the team at Netflix quickly realized their existing infrastructure would not allow for the scalability that they’d eventually need, so they made the intimidating decision to migrate everything to Amazon’s cloud-based AWS in a monolith-to-microservice transition.

During this time, Netflix established two principles learned from the process of moving over their entire infrastructure while minimizing the impact to its millions of users:

  1. No system should ever have a single point of failure. A single point of failure refers to the possibility that one error or failure could lead to hundreds of hours of unplanned downtime.
  2. Never be 100% confident that number one is true. Your team needs an effective way to consistently test and monitor your system to ensure point number one is true (Netflix created chaos monkeys to help handle this—more on that later).

Chaos Monkey: A resiliency tool that helps applications tolerate random instance failures.


AWS Fault Injection Simulator

As always, we have a AWS service which would help to do the chaos testing in our application if they were hosted in Amazon EC2, Amazon EKS, Amazon ECS, and Amazon RDS.

No alt text provided for this image

The process is very much streamlined as an experiment and the outcomes of the experiment are used to resolve the problems, and the same activity goes on until the application becomes resilient.

In the experiment template, AWS provides lot of actions to induce the chaos in our application, some of them are mentioned in the screenshot attached. It would be incredibly painful if these simulations are done manually, thanks to AWS.

No alt text provided for this image

Benefits of FIS

  • Improve application performance, resiliency, and observability
  • Validate how your application performs on AWS
  • Safeguard fault injection experiments
  • A fast and easy way to get started with fault injection experiments
  • Get superior insights by generating real-world failure conditions

References/Links

要查看或添加评论,请登录

Sridhar C R的更多文章

  • MLOps with Cloud: AWS Sagemaker

    MLOps with Cloud: AWS Sagemaker

    Introduction This article talks about the several key features with cloud powered ML developments, workflows on AWS…

    1 条评论

社区洞察

其他会员也浏览了