Debug ECS Fargate Memory Leak

Debug ECS Fargate Memory Leak

Recently, I created new service using ECS fargate. Ever since we started dailing up the traffic, we started seeing some memory leaks in the system. Memory utilisation keep on increasing continuously resulting in service being crashed. [We had step-scaling enabled which helped us replace the ECS tasks after xx threshold, thus preventing the actual crash]

In this article, I will take you through different steps which are required to analyse the potential memory leaks in your system. On a high level, we will cover following things in order-

  • Login into the ECS Container
  • Take Heap Dump
  • Analyse the Heap Dump

Memory Leak

1) Login to ECS Fargate

In order to debug the memory issue, one has to login into the Fargate container. Simplest way to login into ECS container is by using Amazon ECS Exec (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html).

Prerequisites for using ECS Exec

  • Install and configure the AWS CLI
  • Install Session Manager plugin for the AWS CLI

Enabling and using ECS Exec

a) Setup IAM permissions required for ECS Exec - Create policy for heap dump.

{
   "Version": "2012-10-17",
   "Statement": [
       {
       "Effect": "Allow",
       "Action": [
            "ssmmessages:CreateControlChannel",
            "ssmmessages:CreateDataChannel",
            "ssmmessages:OpenControlChannel",
            "ssmmessages:OpenDataChannel"
       ],
      "Resource": "*"
      }
   ]
}        

b) Create an admin role and attach above policy to the admin role. Also, attach ECSTaskInstanceRole to the role.

c) Use script -> https://github.com/aws-containers/amazon-ecs-exec-checker to debug issue setting up the permissions.

d) Enable execute command on your ECS Service

  • aws ecs update-service --cluster <YOUR CLUSTER> --enable-execute-command --service <YOUR SERVICE> --region <YOUR REGION e.g. us-east-1>

e) ECS Exec can not be added to tasks that are already running, so you will not be able to run commands on this task. So you will need to force a deployment.

  • aws ecs update-service --force-new-deployment --cluster <YOUR CLUSTER> --service <YOUR SERVICE> --region <YOUR REGION e.g. us-east-1>

f) Finally, SSH into container using new Task

  • aws ecs execute-command --cluster <YOUR CLUSTER> --task <TASK ID> --container web --interactive --command "/bin/sh" --region <YOUR REGION e.g. us-east-1>

2) Take Heap Dump

As we have login into the fargate container, next step is to take the heap dump.

a) Run ps command to get the processId, usually processId is 1.

  • ps -ef | grep java

b) Install JDK

  • sudo yum install java-1.7.0-openjdk-devel

c) Take the heap dump using jmap command??

  • jmap -dump:live,format=b,file=Task1.hprof 1

Now we have the heap dump available on the ECS task, in order to analyse it we need to download the dump to local, we can either use scp command or copy the dump to S3 and later download it to local for analysis. In oder to copy the heap dump to s3, attach AmazonS3Access policy to the ECS taskInstance role so that container have access to S3.

3) Analyse the Heap Dump

Once we have heap dump available in local, last step is to analyse it using different memory tools available.

We can use Eclipse MAT tool (https://www.eclipse.org/mat/) to load the Dump and analyse the highest object reference and back track the root cause?

Other alternatives are VisualVM (https://visualvm.github.io/) and JProfiler (https://www.ej-technologies.com/products/jprofiler/overview.html)

Lastly, one can setup automated ways as well which downloads and copy the heap dump to S3 as per API requests which enable developers to debug the memory issue taking incremental snapshots of the memory.

Hope this helps!

Thanks,

Rohit

要查看或添加评论,请登录

社区洞察

其他会员也浏览了