AWS Cost Optimization Checklist
Matias Undurraga Breitling
Enterprise Technologist @ AWS | Transformation, Strategic Tech Planning
Managing costs in the cloud is more than just a financial exercise—it’s about creating a sustainable, scalable foundation for your business. Whether you’re running a small startup or a large enterprise, AWS offers a vast array of services and pricing models, making it easy to scale your infrastructure. But with great flexibility comes the challenge of keeping your cloud spending in check.
In this blog I’ve included a comprehensive checklist that you can print, copy, and use as a hands-on tool for your cost optimization journey. Each checklist item is designed to help you evaluate critical aspects of your cloud usage, identify inefficiencies, and implement changes that can result in significant savings.
AWS Cost Optimization Checklist
I. Foundational Cost Management
[ ] Cloud Financial Management: Establish processes, programs, and knowledge to effectively manage cloud costs.
[ ] Consumption-Based Model: Pay only for consumed resources, adjusting usage based on business needs.
[ ] Efficiency Measurement: Track the business value derived from AWS spending to ensure maximum return on investment.
[ ] Right-Sizing: Match resource capacity to actual workload requirements to avoid overspending.
[ ] AWS Discounts and Savings Plans: Utilize Savings Plans and Reserved Instances for discounts on predictable workloads.
[ ] Budgeting and Forecasting: Implement detailed cost tracking, analysis, and forecasting for better budget allocation.
[ ] Regular Usage Reviews: Conduct periodic reviews of resource usage to identify inefficiencies and unused resources.
[ ] Cost Awareness Training: Educate teams to design cost-efficient architectures and promote a culture of cost optimization.
II. Compute Optimization
[ ] EC2 Instance Right-Sizing: Analyze instance performance and adjust types/sizes to match workload needs.
[ ] Auto-Scaling: Implement auto-scaling groups to dynamically adjust EC2 instances based on demand.
[ ] Spot Instances: Leverage Spot Instances for fault-tolerant and flexible workloads for significant cost savings.
[ ] Reserved Instances: Purchase Reserved Instances for predictable, steady-state workloads to reduce hourly rates.
[ ] Scheduled Shutdowns: Automate stopping/starting instances in non-production environments during off-hours.
[ ] Burstable Instances: Use T-family instances for workloads with variable CPU utilization.
[ ] Load Balancer Optimization: Identify and delete unused or underutilized load balancers.
[ ] AMI Optimization: Deregister unused AMIs and delete associated snapshots.
[ ] Lambda Memory Optimization: Allocate appropriate memory to Lambda functions based on performance needs. Take a look : https://github.com/alexcasalboni/aws-lambda-power-tuning - https://aws.amazon.com/blogs/compute/optimizing-your-aws-lambda-costs-part-1/
[ ] Lambda Dependency Management: Remove unnecessary libraries and dependencies from Lambda functions.
[ ] ECS Service Shutdowns: Stop non-essential ECS services during off-peak hours in dev/test environments.
[ ] Fargate for Short-Lived Tasks: Use Fargate for lightweight, short-duration containerized tasks.
[ ] Elastic Beanstalk Instance Optimization: Choose lower-cost instance types in non-critical environments.
[ ] Elastic Beanstalk Auto-Scaling: Utilize auto-scaling to adjust resources based on application load.
[ ] Redshift Cluster Resizing: Use Elastic Resize or Concurrency Scaling to adjust cluster size based on demand.
[ ] Redshift/OpenSearch Non-Prod Pausing: Schedule non-production clusters to pause during off-hours.
[ ] OpenSearch Spot Instances: Run OpenSearch worker nodes on Spot Instances.
III. Storage Optimization
[ ] S3 Lifecycle Policies: Transition objects to cost-effective storage classes (IA, Glacier) based on age/access patterns.
[ ] S3 Versioning Management: Delete old versions of S3 objects while maintaining necessary version history.
[ ] S3 Intelligent-Tiering: Automatically move data between frequent/infrequent access tiers based on usage.
[ ] Data Compression: Compress data in S3 and EFS to reduce storage space.
[ ] S3 Usage Monitoring: Use S3 Storage Lens to identify and delete unused/redundant data.
[ ] S3 Object Expiration: Configure objects to expire and delete automatically when no longer needed.
[ ] S3 Storage Class Selection: Choose the right storage class (Standard, IA, Glacier) based on access patterns.
[ ] GP2 to GP3 Migration: Migrate from older GP2 EBS volumes to GP3 for better performance and cost.
[ ] EBS Volume Cleanup: Delete unattached, orphaned EBS volumes.
[ ] EBS Snapshot Management: Use lifecycle policies to move older snapshots to S3 for cheaper storage.
[ ] EBS Volume Right-Sizing: Adjust EBS volume sizes to meet actual requirements.
[ ] EBS Snapshot Archiving: Utilize the EBS Snapshot Archive feature for rarely accessed snapshots.
[ ] S3 File Consolidation: Combine smaller files into larger ones to optimize S3 PUT requests.
[ ] ECR Lifecycle Policies: Automatically remove old/unused container images in ECR.
[ ] EFS Lifecycle Management: Transition infrequently accessed files to lower-cost EFS storage tiers.
[ ] EFS File System Cleanup: Regularly audit and remove unused EFS file systems.
[ ] OpenSearch UltraWarm: Store infrequently accessed log data in the UltraWarm tier.
IV. Database Optimization
[ ] RDS Instance Right-Sizing: Adjust RDS instance sizes to match workload requirements.
[ ] RDS Read Replicas: Offload read traffic to read replicas to improve performance and reduce primary instance load.
[ ] RDS Automated Backups: Configure backups with appropriate retention periods.
[ ] RDS Multi-AZ for Production Only: Use Multi-AZ deployments only where high availability is critical.
[ ] Database Engine Tuning: Optimize database parameters to enhance performance and reduce resource use.
[ ] Aurora Serverless: Consider Aurora Serverless for variable or unpredictable workloads.
[ ] Database Connection Management: Use connection pooling or proxy services to efficiently manage connections.
[ ] RDS Cold Data Archiving: Move infrequently accessed data to S3 or Glacier.
[ ] Database Engine Updates: Keep database engines up-to-date for performance improvements.
[ ] RDS Reserved Instances: Purchase Reserved Instances for steady-state RDS workloads.
[ ] DynamoDB On-Demand Capacity: Use on-demand mode for tables with unpredictable traffic.
[ ] DynamoDB Auto-Scaling: Enable auto-scaling to adjust provisioned capacity based on traffic.
[ ] Aurora Auto-Scaling: Configure auto-scaling for supported RDS databases.
[ ] RDS Non-Prod Instance Shutdown: Schedule non-production instances to shut down during off-hours.
[ ] Aurora Backtrack: Use Backtrack for point-in-time recovery to potentially reduce backup needs.
[ ] Aurora Global Database (When Necessary): Use only when low-latency global replication is required.
[ ] Redshift Query Optimization: Rewrite complex queries to reduce data scanned and compute time.
V. Networking Optimization
Read about IPv6 vs IPv4: https://blog.2minutestreaming.com/p/basic-aws-networking-costs
Read about data transfer local: https://www.dhirubhai.net/pulse/keeping-data-transfers-local-matias-undurraga-breitling-dmlye/
[ ] Elastic IP Cleanup: Release unassociated Elastic IPs.
领英推荐
[ ] NAT Gateway Cleanup: Identify and delete unused NAT Gateways.
[ ] Minimize Cross-AZ Traffic: Design architecture to reduce data transfer between Availability Zones.
[ ] NAT Instances (Dev/Test): Consider using NAT Instances instead of NAT Gateways in non-critical environments.
[ ] Single-AZ Dev/Test Environments: Configure dev/test environments in a single Availability Zone unless multi-AZ is needed.
[ ] VPC Endpoint Optimization: Review and remove underused VPC Interface Endpoints.
[ ] S3/DynamoDB Gateway Endpoints: Use Gateway Endpoints to avoid data transfer costs through NAT Gateways.
[ ] NAT Gateway Sharing (Non-Prod): Share a single NAT Gateway across multiple VPCs in non-production via Transit Gateways.
[ ] Route 53 TTL Optimization: Set higher TTL values for stable DNS records.
[ ] CloudFront or S3 Transfer Acceleration: Optimize data transfer, especially for data going out of AWS.
[ ] Application Load Balancers: Prefer ALBs over Classic Load Balancers for HTTP/HTTPS traffic.
[ ] ELB Idle Connection Optimization: Adjust idle timeout settings to avoid unnecessary charges.
[ ] Load Balancer Audit: Ensure no idle load balancers are incurring costs.
[ ] Target Group Auto-Scaling: Pair load balancers with auto-scaling target groups.
[ ] CloudFront Cache Optimization: Use techniques like caching dynamic content and adjusting TTLs.
[ ] CloudFront Regional Edge Caches: Extend caching closer to users with regional edge caches.
[ ] CloudFront Log Analysis: Identify traffic patterns and optimize distribution settings.
[ ] CloudFront Distribution Cleanup: Delete unused or outdated CloudFront distributions.
[ ] IoT Device Communication: Use efficient protocols like MQTT to reduce data transfer.
VI. Monitoring and Logging Optimization
[ ] CloudWatch Log Filtering: Send only essential logs to CloudWatch to reduce ingestion/storage costs.
[ ] CloudWatch Logs Insights: Analyze logs efficiently and identify unnecessary data.
[ ] CloudWatch Log Retention: Define retention policies to automatically delete old logs.
[ ] CloudWatch to S3 Archiving: Export logs to S3 for long-term storage and use lifecycle policies.
[ ] Lambda Logging: Configure Lambda functions to log only warnings and errors.
[ ] CloudWatch Metrics Filters: Create metrics filters only for essential log events.
[ ] AWS Config Exclusions: Record only necessary resources using the "Record with Exclusions" feature.
[ ] AWS Config in Dev Environments: Limit Config Rules and use periodic recording in dev environments.
[ ] Targeted Detailed Monitoring: Use detailed monitoring only where needed to reduce CloudWatch costs.
[ ] Lambda Metrics Analysis: Use CloudWatch to monitor Lambda function performance and optimize configurations.
VII. Backup and Recovery Optimization
[ ] Backup Retention Periods: Define appropriate retention periods to avoid storing data longer than needed.
[ ] Avoid Cross-Region Backup Transfers: Keep backups in the same region as source resources unless required.
[ ] Optimized Backup Frequency: Align backup frequency with RTOs and RPOs.
[ ] Incremental Backups: Back up only changes since the last backup.
[ ] Automated Backup Deletion: Delete backups that exceed retention requirements.
[ ] Backup Archiving: Move older backups to lower-cost storage like S3 Glacier.
[ ] Regular Backup Audits: Identify and delete unused or redundant backups.
VIII. Data Analytics Optimization
[ ] Glue Data Partitioning: Partition datasets in Glue to process only relevant data.
[ ] Glue Job Optimization: Analyze job metrics to improve performance and reduce runtime.
[ ] Glue Job Auto-Scaling: Scale resources dynamically based on workload demand.
[ ] Glue Dev Endpoint Cleanup: Remove unused development endpoints.
[ ] Athena Data Partitioning: Partition data in S3 to reduce query costs by limiting data scanned.
[ ] Athena Compressed File Formats: Store data in compressed formats like Parquet or ORC.
[ ] Athena Query Optimization: Select only needed columns/rows and use WHERE clauses effectively.
[ ] Athena Query Result Reuse: Leverage query result reuse to avoid re-running queries.
[ ] Athena Query Cost Monitoring: Track query costs and set usage limits with budget alerts.
[ ] Redshift Spectrum for S3: Query data in S3 directly to avoid loading unnecessary data into clusters.
IX. Account Structure and Governance
[ ] Cost-Optimized Region Selection (Dev/Test): Deploy dev/test environments in regions with lower costs.
[ ] Region Blocking: Disable unused AWS regions to prevent accidental deployment and security risks.
[ ] Directory Service Cleanup: Audit and delete unneeded Directory Service instances.
[ ] Least Privilege Access: Limit IAM permissions to the minimum required for tasks.
[ ] IAM User/Role Audits: Regularly review and remove inactive users and roles.
[ ] Multi-Factor Authentication (MFA): Secure access to prevent unauthorized actions and resource misuse.
[ ] Consolidated Billing: Aggregate billing for multiple accounts in an Organization.
[ ] Service Control Policies (SCPs): Limit resource usage and enforce cost policies across accounts.
[ ] Environment Isolation: Separate production, dev, and test into different accounts.
[ ] Per-Account Cost Monitoring: Track and optimize spending for each account using Cost Explorer.
[ ] Private ECR Image Use: Store and use container images privately in ECR unless public distribution is needed.
[ ] S3 Bucket ACL/Permission Review: Ensure proper access controls to prevent unauthorized access and costs.
X. Automation and Tooling
[ ] AWS Cost Explorer: Regularly use Cost Explorer to identify high-cost areas and trends.
[ ] AWS Trusted Advisor: Check for cost optimization recommendations.
[ ] Detailed Billing Reports: Use Cost and Usage Reports for in-depth cost analysis.
[ ] AWS Compute Optimizer: Get recommendations for EC2, Lambda, and EBS resource right-sizing.
[ ] AWS Budget Alerts: Set budget thresholds and receive notifications for proactive cost control.
[ ] Savings Plans Recommendations: Explore recommendations in the Cost Management Console.
[ ] Automated Instance Scheduling: Use Systems Manager or other tools to start/stop instances based on schedules.
[ ] Automated Shutdowns (Non-Prod): Schedule non-production resources to shut down during off-hours.