登录查看更多内容

Chaos Engineering on AWS: How AWS FIS Enhances System Resilience

Emir ?ztürk

Lead Cloud Engineer @Commencis

发布日期: 2025年3月3日

Assume that you have an application, never mind in the Cloud or on-premise, you deploy front-end and back-end and want to take it to the production environment. You thought that everything was ok and the application would work fine. But suddenly your app crashed and gave you an error that you didn't expect. So how we can prevent this kind of damage to our apps?

Today I want to talk about a topic that is not given much importance but is actually very critical, Chaos Engineering. So what is Chaos Engineering?

Chaos engineering is a formalized approach that uses fault injection experiments to create real-world conditions needed to understand how your system will react to unknowns and build confidence in the system’s resiliency and security.

Modern applications can have multiple components, including web, API, application, and data persistence layers. To respond to potential security events, you must understand the failure scenarios across each component and their downstream impacts. You run experiments and after that, you will realize that every component of your app is working based on best practices.

Let's clarify this subject based on an example. You have an application that is deployed on EC2 with autoscaling. You want to ensure that all of the created instances with the autoscale mechanism working fine and there will be no issue on production.

So if you want to test your autoscaling deployment, what is your scenario? Well, maybe the first thing is doing the test manually. I mean shut down one of the servers and after that, you can see whether all of the deployment works fine or not. Your launch template, user-data scripts, and so on. But as you know, all manual operations have some margin of error. AWS introduced and fully managed service for chaos engineering with the name of FIS or Fault Injection Service.

AWS FIS is a fully managed service for running fault injection experiments. FIS supports multiple fault injection actions, such as injecting API errors, restarting instances, running scripts on instances, disrupting network connectivity, EBS and EC2 actions, EKS pod io stress, s3 bucket pause replication, and more.

On the other hand, if you want to know what is fault injection, fault injection is the process of stressing an application in testing or production environments by creating disruptive events, such as server outages or API throttling. From observing how the system responds, you can then implement improvements.

How about the experiment? When you want to have fault injection and test your workload you have to define an experiment template for your workload. Like start or stop your EC2 instance based on a defined ARN or instance ID. It’s better to test this experiment template on your staging environment not on your production environment. By running experiments in staging, you can see how your system will likely react in production while earning trust within your organization.

Ok, It's POC time, here is my scenario for testing FIS on AWS. I have 1 VPC with 2 Private and 2 Public Subnets which are connected to the internet via IGW and Nat GW. We have a CloudWatch alarm for CPU Utilization and if the CPU percentage goes above 75%, the alarm comes into action and creates a new instance based on the Alarm action. We will do a CPU stress test on EC2 instances using FIS and AWS Systems Manager.

We have an EC2 instance with Ubuntu 22.04 OS and t3.medium instance type. For autoscaling, I have created a Launch Template with the below details.

Based on this creation we will continue to create an ASG with the below details.

For this scenario, I didn’t choose any integration like ALB or Lattice. But for Automatic Scaling I have determined the option based on CPU Utilization.

Now, we are ready to deploy FIS. First type FIS on the search box and after that click on Create Experiment from Scenario.

You can create a custom experiment or you can select from an existing library. For our scenario, I will continue with EC2 Stress: CPU

You can see all of the details about CPU Stress below. For prerequisites as you can see, we have to install an SSM agent, we will use stress-ng for the stress test and the instances should have network access add a tag named EC2StrssCPU with a value of Allowed on each instance you would like to be affected or you can override this tag with one of your own.

EC2 tage -> Key: EC2StressCPU and Value: Allowed

As you can see SSM agent is installed on the Ubuntu server. You can check with this command.

sudo snap list amazon-SSM-agent

Select the scenario and click on Create a template with the scenario. You can select your target based on account for an existing target or multiple targets. But To grant multi-account permissions, you will need to set up roles in each target account. You can check details about permission you can check this link.

In 2nd step, we can see details about actions and targets mentioned about EC2 instances with specified tag values.

You can have one or more actions in order to take action. Above we select 3 actions for CPU stress for 5 mins interval but for testing purposes, you can select one action. In the target section, you can see all of the details about the target that we defined earlier. You can see that we determine EC2 instances based on specific tags.

In 3rd step, you can create or choose an existing role for FIS to conduct experiments on your behalf.

In 4th step, we have some options,

Stop Condition: You set a limit, known as a stop condition, to end the experiment if it reaches the threshold defined by a CloudWatch alarm. If a stop condition is reached during an experiment, you can't resume the experiment.
Report Configuration: AWS FIS generates experiment reports as evidence of resilience testing. Configure report settings so that an FIS experiment report is delivered to an S3 bucket. You can download the PDF once the experiment is completed. FIS generates reports with an associated cost. This part is optional and you can set the details later.
Logs: you can send your Experiments logs to S3 or CloudWatch Logs. Amazon FIS doesn't charge for sending the logs. However, ingestion and storage charges apply based on the destination.

Review all of the deployment details and click on Create experiment template.

Ok, everything is ready and we can start the experiment.

As you can see below, the experiment started, and the command is running on the server.

After completing the experiment CloudWatch action takes action and based on the pre-defined settings new instance hasbeen created from the AutoScaling policy.

Conclusion:

AWS FIS has lots of features like stress tests for CPU or memory or you can test your network latency and so on. But the important point here is that you have to define the correct experiment and create the best template based on your app requirements. This demo and article just is a little sample for showing AWS FIS abilities. We test how AWS FIS and Systems Manager do some CPU stress on defined EC2 instances and how Autoscaling takes action during this stress test. You can review all of the details about all of the actions following this link.

References:

要查看或添加评论，请登录

Emir ?ztürk的更多文章

AWS DRS, What is it, and how it works?

2025年2月26日

AWS DRS, What is it, and how it works?

Nowadays every enterprise or small business wants to be safe from any disaster. This could be provided in some ways…
Deploy Hello World App using AWS SAM

2025年1月8日

Deploy Hello World App using AWS SAM

In this article, we will discuss AWS SAM (Serverless Application Model) which provides features like deploying…
How to Secure AWS RDS?

2024年7月15日

How to Secure AWS RDS?

Everyone who works with Cloud, especially with AWS knows about RDS. I don't want to explain RDS and which…
Zero Trust on AWS

2024年7月13日

Zero Trust on AWS

In my previous article (Amazon Verified Access) I mentioned Zero Trust and how we can deploy it in the AWS environment.…
Amazon API Gateway: Security Overview

2024年7月7日

Amazon API Gateway: Security Overview

Hello LinkedIn, in this Article I have tried to describe and have an overview of the security of AWS API Gateway…
AWS Global Accelerator, What is it and how to deploy?

2024年6月30日

AWS Global Accelerator, What is it and how to deploy?

Today, we will talk about AWS Global Accelerator which is one of the Networking service in AWS. So if you want better…

2 条评论
Amazon Verified Access: Connecting to your Apps without Using VPN.

2024年6月23日

Amazon Verified Access: Connecting to your Apps without Using VPN.

Today I want to review one of the interesting AWS Networking features named Amazon Verified Access or AVA. When I…

1 条评论
What is AWS VPC Lattice?

2024年6月4日

What is AWS VPC Lattice?

A network is a means of communicating between devices. AWS Networking helps you to build a fast, dependable, and secure…

1 条评论
Data Lake on AWS

2024年5月15日

Data Lake on AWS

As the volume of customers’ data grows, companies realize the benefits that data has for their business. Amazon Web…
AWS Well-Architected Framework: Enhancing Cloud Architecture

2024年5月2日

AWS Well-Architected Framework: Enhancing Cloud Architecture

The AWS Well-Architected Framework is a collection of best practices that allow customers to evaluate and improve the…

1 条评论

See all articles

Conclusion:

References:

Emir ?ztürk的更多文章

AWS DRS, What is it, and how it works?

Deploy Hello World App using AWS SAM

How to Secure AWS RDS?

Zero Trust on AWS

Amazon API Gateway: Security Overview

AWS Global Accelerator, What is it and how to deploy?

Amazon Verified Access: Connecting to your Apps without Using VPN.

What is AWS VPC Lattice?

Data Lake on AWS

AWS Well-Architected Framework: Enhancing Cloud Architecture