Debug ECS Fargate Memory Leak
Recently, I created new service using ECS fargate. Ever since we started dailing up the traffic, we started seeing some memory leaks in the system. Memory utilisation keep on increasing continuously resulting in service being crashed. [We had step-scaling enabled which helped us replace the ECS tasks after xx threshold, thus preventing the actual crash]
In this article, I will take you through different steps which are required to analyse the potential memory leaks in your system. On a high level, we will cover following things in order-
1) Login to ECS Fargate
In order to debug the memory issue, one has to login into the Fargate container. Simplest way to login into ECS container is by using Amazon ECS Exec (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html).
Prerequisites for using ECS Exec
Enabling and using ECS Exec
a) Setup IAM permissions required for ECS Exec - Create policy for heap dump.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel"
],
"Resource": "*"
}
]
}
b) Create an admin role and attach above policy to the admin role. Also, attach ECSTaskInstanceRole to the role.
c) Use script -> https://github.com/aws-containers/amazon-ecs-exec-checker to debug issue setting up the permissions.
d) Enable execute command on your ECS Service
e) ECS Exec can not be added to tasks that are already running, so you will not be able to run commands on this task. So you will need to force a deployment.
f) Finally, SSH into container using new Task
领英推荐
2) Take Heap Dump
As we have login into the fargate container, next step is to take the heap dump.
a) Run ps command to get the processId, usually processId is 1.
b) Install JDK
c) Take the heap dump using jmap command??
Now we have the heap dump available on the ECS task, in order to analyse it we need to download the dump to local, we can either use scp command or copy the dump to S3 and later download it to local for analysis. In oder to copy the heap dump to s3, attach AmazonS3Access policy to the ECS taskInstance role so that container have access to S3.
3) Analyse the Heap Dump
Once we have heap dump available in local, last step is to analyse it using different memory tools available.
We can use Eclipse MAT tool (https://www.eclipse.org/mat/) to load the Dump and analyse the highest object reference and back track the root cause?
Other alternatives are VisualVM (https://visualvm.github.io/) and JProfiler (https://www.ej-technologies.com/products/jprofiler/overview.html)
Lastly, one can setup automated ways as well which downloads and copy the heap dump to S3 as per API requests which enable developers to debug the memory issue taking incremental snapshots of the memory.
Hope this helps!
Thanks,
Rohit