Cutting Costs: Saving $30k+ Per Month with AWS Infrastructure Optimization

Cutting Costs: Saving $30k+ Per Month with AWS Infrastructure Optimization

After much procrastination, I've finally motivated myself to write a tech blog—something that has been on my to-do list for quite some time.

In this blog, I'll take you through the journey of how we optimised cloud costs at my current company. You'll get an in-depth look at the strategies we implemented, the challenges we faced, and the significant savings we achieved. So, let's dive right in.

Background:

Post-COVID, companies across industries were forced to optimize their operational costs to ensure long-term sustainability. Our approach involved a comprehensive analysis of our existing infrastructure to identify opportunities for cost savings. This included evaluating our cloud usage, optimizing resource allocation and adopting policies and solutions that would help the company in the long-run.


Optimisation Strategies:

To develop effective optimization strategies, we held multiple sessions with our AWS Technical Account Manager. These sessions were instrumental in conducting a comprehensive AWS Well Architected (WA) analysis, which provided a clear understanding of our current infrastructure status. This analysis allowed us to identify gaps in our processes and offered valuable insights for improvement. One of the key pillars in the AWS Well-Architected Tool is Cost Optimization. In this article, we will delve deeply into the strategies and best practices for cost optimization.

Decommissioning Unused VMs:

  • This was a No-brainer. EC2 instances often add up to a significant portion of our costs. We compiled a list of virtual machines (VMs) across different environments and identified those that were no longer in use. After finalizing the list, we communicated the decommissioning plan to all stakeholders and secured the necessary approvals before proceeding.
  • The unused VMs were stopped and monitored for 5 days . After 5 days, they were terminated. This saved us around $4000.

Note: Do not terminate the instance immediately. General advise is to have it in Stopped state for at least 5 days.

  • Thanks to our FinOps team, we significantly improved our infrastructure provisioning and approval workflows, making the processes more streamlined and efficient. It is recommended to have a strong approval process around VM provisioning. The provision request should include the following:

  1. Reason for VM provision
  2. Name and Department of the requestor (For tagging purposes)
  3. Instance type with proper analysis and reasoning.
  4. Cost of the instance
  5. The expected duration for which the instance is intended to be used.
  6. If its for Testing or POC purpose, then there has to be a VM decommission date in the request.
  7. Approvals of the necessary stakeholders



Deletion of Snapshots:

  • Every snapshot in your AWS account incurs a charge, calculated based on the snapshot’s size and the length of time it has been stored. The longer a snapshot remains, the more it costs, making it crucial to manage and optimize snapshot storage to avoid unnecessary expenses. Check out this link to understand how snapshots are being charged in AWS.
  • We established our Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for volumes and instances, and after securing approval from our business stakeholders, we proceeded with the cleanup process.
  • We deleted all outdated snapshots and implemented a policy to retain only two snapshots per instance or volume within a 7-day period. This approach ensures that we maintain the necessary backups while minimizing unnecessary storage costs.
  • Above efforts saved us another $3000 / month


Deletion of objects in S3 bucket and having a retention policy in place:

  • We initially lacked a policy for managing our S3 buckets, leading to the accumulation of outdated objects. Upon review, we identified buckets containing obsolete files, such as old log backups. We decided to delete these files and remove the unnecessary buckets. To prevent future clutter, we established an S3 lifecycle policy to automatically delete application log backups older than six months.
  • We also made sure that the lower environments don't have any system or application logs backed up.
  • This saved us another $3000 / month

Deleting the Unattached Volumes:

  • This was a low hanging fruit.
  • We identified the EBS Volumes that were unattached and deleted them. Check out this AWS link to identify whether an EBS volume's state.
  • This helped us save $1000
  • The best practice is to ensure that when an EC2 instance is terminated, any associated EBS volumes that are no longer needed are automatically deleted. You can set this up when launching instances by enabling the "Delete on Termination" option.

How to check if the Volume is Unattached ?

  1. Log in to the AWS Management Console.
  2. Navigate to the EC2 Dashboard.
  3. In the left-hand menu, under Elastic Block Store (EBS), click on Volumes.
  4. Check the State and Attachment Information columns
  5. Available: Indicates the volume is unattached.
  6. In-use: Indicates the volume is attached to an instance.


Right-sizing the VMs:

  • With the support of the FinOps and SysOps teams, we identified several under-utilised VMs. We conducted a detailed analysis of memory and CPU utilization for each VM to assess their performance. It became clear that some VMs were significantly under-utilised. Based on this analysis, we resized many VMs and databases to more appropriate instance types.
  • This optimization resulted in substantial cost savings—approximately $10,000.
  • Check out this elaborate whitepaper which discusses the best practices of right sizing AWS instances.

Note: While this process can be time-consuming and tiring, the impact on cost efficiency is significant.


Decommissioning the Old Load Balancers:

  • After migrating to EKS, we began utilizing the internal EKS ingress load balancer.
  • The old Application Load Balancers (ALBs) in AWS were no longer in use.
  • We carefully reviewed all application configurations, migrated them to the new load balancer, and decommissioned the outdated ALBs.
  • Given that load balancers are charged based on hourly usage, this action helped eliminate unnecessary costs. Check out this link to understand how the Load Balancers are charged in AWS.
  • This wasn't big but every penny makes a difference.


Migrating Volume from GP2 type to GP3 type:

  • This was another no-brainer.
  • AWS offers various types of EBS volumes, including gp2, gp3, io1, and io2. We decided to migrate all our EBS volumes from General Purpose 2 (gp2) to General Purpose 3 (gp3), which brought significant benefits.
  • The gp3 volumes offer a consistent 3,000 IOPS and higher throughput, providing better performance at a lower cost. Migration of gp2 to gp3 volumes brought in an immediate cost improvement of 20%.
  • The launch volume for our EKS nodes was initially configured as gp2. We’ve now switched to gp3 to ensure that all new volumes created during node provisioning in the EKS cluster are gp3 by default.
  • As a result of this migration, we saw a substantial improvement in our batch processing performance. For example, the Bill Dispatch process, which initially took 8 hours, now completes in under 3.5 hours. I’ll be writing a detailed article on this performance improvement soon.


Savings Plan for RDS:

We observed that the size of the RDS instance for many services didn't change. AWS offers Savings plan options for all our On-Demand instances. We chose to go for Savings plan option for all our RDS instances. This further reduced our costs by 30 percent.

Note: If you intend to change your instance type in the near future, it is better not to opt for this option.


Future Cost-optimisation opportunities

EKS Node Right Sizing:

The services running in our EKS cluster seems to be over-provisioned. There is scope for reducing the workload size in the cluster.

The POD memory requests and POD CPU requests configured for some services are very high. We can verify their current usage and do some right sizing. This can help us reduce a few nodes in the K8s cluster.


Implementing Auto Scaling in K8s:

Auto-scaling the nodes in the EKS cluster is one way to optimise the costs. This is planned as a future improvement.

Horizontal Pod Auto-scaling (HPA) automatically adjusts the number of pod replicas in a deployment based on CPU utilization or other metrics. This ensures that the applications have enough resources during peak times and scaling down during periods of low demand.

The Cluster Autoscaler works at the node level, scaling the number of nodes in your cluster up or down based on the resource requirements of your pods. When the demand increases, the Cluster Autoscaler adds nodes to handle the load; when demand decreases, it removes underutilized nodes.

Implementing these two mechanisms help optimize costs by ensuring that you only pay for the compute resources you actually need.


Conclusion

In conclusion, implementing effective optimization strategies and policies can lead to significant long-term cost savings for the company.

It's advisable to include cloud cost optimization as part of the company's OKRs to continuously monitor and manage resource usage efficiently.


Sumith Subasinghe

Driving Operational Excellence in SRE, Cloud and On-Premises Infrastructure Management, FinOps, Vendor Management & IT Audits – Expert in Managing Automated Trading Systems & Mission-Critical Apps

7 个月

Superb article & you have been continuously chase us ( SysOps & FinOps - Nicholas See) to complete these saving bro. I must appreciate the tremendous support we have received from Long Chen ?? & AWS SG team...

回复
Monika Puhazhendhi

Staff Engineer at Circles.Life

7 个月

Very practical tips, good read Avinash Narasimhan

Dhruv Parmar

Business Intelligence | AWS | Azure | Kubernetes |DevOps | Docker | Data Engineering

7 个月

Interesting article it’s very informative, I wish you could have included the infrastructure part like what was used to provision the additional infrastructure like Cloudformation / Terraform or was it done via management console / SDK and how was the approach better w.r.t your architecture.

回复
Dmitry Pozdnyakov

Performance Engineer| Co-Founder & CTO at LTE Team | Performance testing visionary | Transforming load testing dynamics

7 个月

It sounds like a good summary of a long long journey. Great work. Avinash Narasimhan

回复
Manjunath Sudheer

Business Intelligence Engineer II at Amazon

7 个月

Very informative

要查看或添加评论,请登录

社区洞察

其他会员也浏览了