Chaos Engineering on AWS: How AWS FIS Enhances System Resilience
Assume that you have an application, never mind in the Cloud or on-premise, you deploy front-end and back-end and want to take it to the production environment. You thought that everything was ok and the application would work fine. But suddenly your app crashed and gave you an error that you didn't expect. So how we can prevent this kind of damage to our apps?
Today I want to talk about a topic that is not given much importance but is actually very critical, Chaos Engineering. So what is Chaos Engineering?
Chaos engineering is a formalized approach that uses fault injection experiments to create real-world conditions needed to understand how your system will react to unknowns and build confidence in the system’s resiliency and security.
Modern applications can have multiple components, including web, API, application, and data persistence layers. To respond to potential security events, you must understand the failure scenarios across each component and their downstream impacts. You run experiments and after that, you will realize that every component of your app is working based on best practices.
Let's clarify this subject based on an example. You have an application that is deployed on EC2 with autoscaling. You want to ensure that all of the created instances with the autoscale mechanism working fine and there will be no issue on production.
So if you want to test your autoscaling deployment, what is your scenario? Well, maybe the first thing is doing the test manually. I mean shut down one of the servers and after that, you can see whether all of the deployment works fine or not. Your launch template, user-data scripts, and so on. But as you know, all manual operations have some margin of error. AWS introduced and fully managed service for chaos engineering with the name of FIS or Fault Injection Service.
AWS FIS is a fully managed service for running fault injection experiments. FIS supports multiple fault injection actions, such as injecting API errors, restarting instances, running scripts on instances, disrupting network connectivity, EBS and EC2 actions, EKS pod io stress, s3 bucket pause replication, and more.
On the other hand, if you want to know what is fault injection, fault injection is the process of stressing an application in testing or production environments by creating disruptive events, such as server outages or API throttling. From observing how the system responds, you can then implement improvements.
How about the experiment? When you want to have fault injection and test your workload you have to define an experiment template for your workload. Like start or stop your EC2 instance based on a defined ARN or instance ID. It’s better to test this experiment template on your staging environment not on your production environment. By running experiments in staging, you can see how your system will likely react in production while earning trust within your organization.
Ok, It's POC time, here is my scenario for testing FIS on AWS. I have 1 VPC with 2 Private and 2 Public Subnets which are connected to the internet via IGW and Nat GW. We have a CloudWatch alarm for CPU Utilization and if the CPU percentage goes above 75%, the alarm comes into action and creates a new instance based on the Alarm action. We will do a CPU stress test on EC2 instances using FIS and AWS Systems Manager.
We have an EC2 instance with Ubuntu 22.04 OS and t3.medium instance type. For autoscaling, I have created a Launch Template with the below details.
Based on this creation we will continue to create an ASG with the below details.
For this scenario, I didn’t choose any integration like ALB or Lattice. But for Automatic Scaling I have determined the option based on CPU Utilization.
Now, we are ready to deploy FIS. First type FIS on the search box and after that click on Create Experiment from Scenario.
You can create a custom experiment or you can select from an existing library. For our scenario, I will continue with EC2 Stress: CPU
You can see all of the details about CPU Stress below. For prerequisites as you can see, we have to install an SSM agent, we will use stress-ng for the stress test and the instances should have network access add a tag named EC2StrssCPU with a value of Allowed on each instance you would like to be affected or you can override this tag with one of your own.
EC2 tage -> Key: EC2StressCPU and Value: Allowed
As you can see SSM agent is installed on the Ubuntu server. You can check with this command.
sudo snap list amazon-SSM-agent
Select the scenario and click on Create a template with the scenario. You can select your target based on account for an existing target or multiple targets. But To grant multi-account permissions, you will need to set up roles in each target account. You can check details about permission you can check this link.
In 2nd step, we can see details about actions and targets mentioned about EC2 instances with specified tag values.
You can have one or more actions in order to take action. Above we select 3 actions for CPU stress for 5 mins interval but for testing purposes, you can select one action. In the target section, you can see all of the details about the target that we defined earlier. You can see that we determine EC2 instances based on specific tags.
In 3rd step, you can create or choose an existing role for FIS to conduct experiments on your behalf.
In 4th step, we have some options,
Review all of the deployment details and click on Create experiment template.
Ok, everything is ready and we can start the experiment.
As you can see below, the experiment started, and the command is running on the server.
After completing the experiment CloudWatch action takes action and based on the pre-defined settings new instance hasbeen created from the AutoScaling policy.
Conclusion:
AWS FIS has lots of features like stress tests for CPU or memory or you can test your network latency and so on. But the important point here is that you have to define the correct experiment and create the best template based on your app requirements. This demo and article just is a little sample for showing AWS FIS abilities. We test how AWS FIS and Systems Manager do some CPU stress on defined EC2 instances and how Autoscaling takes action during this stress test. You can review all of the details about all of the actions following this link.
References: