登录查看更多内容

Chaos with AWS

Sridhar C R

Senior Software Engineer | Python, JavaScript | AWS | Microservices | Design Patterns

发布日期: 2022年9月4日

Problem Statement

So after spending good amount of time in feature testing the big cloud application, most of us believe it would perform well in production, it might not perform well as expected / mentioned in SLA. Under ideal conditions it would perform well, but what if that's not the case.

Solution

We would build applications that are resilient for failures, efficient, less latency, integrity and self sustainable fashion. In order to build that, we would need chaos engineering to proceed with.

Bit of a background

What is Chaos Engineering / Testing?

"Chaos Engineering?is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."

Principles of Chaos Testing

The principles of chaos testing are as follows,

Ensure your system works and define a steady state: To start with we need to define a steady state, where the system works as expected.
Hypothesize the system’s steady state will hold: Once a steady state has been determined, it must be hypothesized that it will continue in both control and experimental conditions.
Ensure minimal impact to your users: ?During chaos testing, the goal is to actively try to break or disrupt the system, but it’s important to do so in a way that minimizes the blast radius and any negative impact to your users.?
Introduce chaos: Once you are confident that your system is working, your team is prepared, and the blast radius is contained, you can start running your chaos testing applications.
Monitor and repeat: With chaos engineering, the key is to test consistently, introducing chaos to pinpoint any weaknesses within your system. This process would start sustaining like a negative feedback loop to build strong system.

Origin Story

When Netflix was migrating to AWS cloud, they first came up with this intelligent testing strategy, which helped them to resolve issues and build a strong system.

In 2010, development and operations teams at Netflix started the process of moving their entire infrastructure over to AWS (Amazon Web Services). At the time, the team at Netflix quickly realized their existing infrastructure would not allow for the scalability that they’d eventually need, so they made the intimidating decision to migrate everything to Amazon’s cloud-based AWS in a monolith-to-microservice transition.

During this time, Netflix established two principles learned from the process of moving over their entire infrastructure while minimizing the impact to its millions of users:

No system should ever have a single point of failure. A single point of failure refers to the possibility that one error or failure could lead to hundreds of hours of unplanned downtime.
Never be 100% confident that number one is true. Your team needs an effective way to consistently test and monitor your system to ensure point number one is true (Netflix created chaos monkeys to help handle this—more on that later).

领英推荐

Nathan Luxford: Tesco, the best-kept secret in tech

Tesco Technology 2 年前

?? AWS Weekly #375: How to Implement Multi-Tenancy…

FAUN - Developer Community 1 年前

December Newsletter: AWS re:Invent 2022 Recap

Cloudvisor 2 年前

Chaos Monkey: A resiliency tool that helps applications tolerate random instance failures.

AWS Fault Injection Simulator

As always, we have a AWS service which would help to do the chaos testing in our application if they were hosted in Amazon EC2, Amazon EKS, Amazon ECS, and Amazon RDS.

The process is very much streamlined as an experiment and the outcomes of the experiment are used to resolve the problems, and the same activity goes on until the application becomes resilient.

In the experiment template, AWS provides lot of actions to induce the chaos in our application, some of them are mentioned in the screenshot attached. It would be incredibly painful if these simulations are done manually, thanks to AWS.

Benefits of FIS

Improve application performance, resiliency, and observability
Validate how your application performs on AWS
Safeguard fault injection experiments
A fast and easy way to get started with fault injection experiments
Get superior insights by generating real-world failure conditions

References/Links

Sridhar C R的更多文章

MLOps with Cloud: AWS Sagemaker

2022年8月14日

MLOps with Cloud: AWS Sagemaker

Introduction This article talks about the several key features with cloud powered ML developments, workflows on AWS…

1 条评论

Chaos with AWS

Sridhar C R

Senior Software Engineer | Python, JavaScript | AWS | Microservices | Design Patterns

Problem Statement

Solution

Bit of a background

Principles of Chaos Testing

Origin Story

领英推荐

AWS Fault Injection Simulator

Benefits of FIS

References/Links

Sridhar C R的更多文章

社区洞察

其他会员也浏览了

Deploy Microservices with AWS ECS

Mastering GitLab Runner on AWS EC2: A Step-by-Step Guide

AWS re:Invent 2024 Top 5

AWS Goodies - April 11, 2024

Introducing FinOps Maven

Migrating from Bitbucket to GitLab: Our Journey to a Faster, More Reliable CI/CD Pipeline

The Serverless Conundrum, Part II of II: Drilling Deeper

Aws Microservices

Kubernetes

Problem Statement

Solution

Bit of a background

Principles of Chaos Testing

Origin Story

领英推荐

AWS Fault Injection Simulator

Benefits of FIS

References/Links

Sridhar C R的更多文章

MLOps with Cloud: AWS Sagemaker

社区洞察

其他会员也浏览了

Deploy Microservices with AWS ECS

Mastering GitLab Runner on AWS EC2: A Step-by-Step Guide

AWS re:Invent 2024 Top 5

AWS Goodies - April 11, 2024

Introducing FinOps Maven

Migrating from Bitbucket to GitLab: Our Journey to a Faster, More Reliable CI/CD Pipeline

The Serverless Conundrum, Part II of II: Drilling Deeper

Aws Microservices

Kubernetes