登录查看更多内容

AWS Compute Optimizer and the art & science of optimization

Jacob Ben-David

AI & Cloud Business Leader at Red Hat

发布日期: 2019年12月17日

During and before re:Invent 2019, Amazon unveiled over 77 new product announcements and unique capabilities. One of the offerings, quietly announced through a blog on December 3rd, the first day of re:Invent, was AWS Compute Optimizer.

AWS Compute Optimizer is a new addition to AWS' expanding set of native tools that are focused on helping customers to regain control of their cloud bills. The other native tools are AWS Trusted Advisor, and Amazon EC2 Resources Optimization Recommendations service (under Cost Explorer), which was introduced in July 2019.

In this post, we will take an in-depth look at AWS Compute Optimizer, its capabilities, and how it compares to the existing offerings from AWS and answer the question, is it enough?

AWS Compute Optimizer Overview

Compute Optimizer is a Machine Learning (ML) based tool that analyzes CloudWatch metrics of EC2 instances and Auto Scaling groups and generates recommendations to help users with choosing the optimal instance types for their workloads.

Analyzed Metrics

By Default, Compute Optimizer will analyze CPU, Storage IO, and network IO utilization (ingress and egress from all NICs), collected from CloudWatch. Users can enable OS-level memory metrics by installing and configuring the CloudWatch Agent.

If memory is not collected, AWS promises that the tool will try not to reduce the memory capacity assigned to EC2 instances. This is an improvement from the Cost explorer rightsizing recommendations.

Amazon recommends enabling detailed monitoring on CloudWatch, which will increase the data collection increments from 5-minutes to 1-minute (note that detailed monitoring on CloudWatch will result in extra charges).

Observation Period

The recommendations will be based on the last 14 days of data; the sampling period is not configurable. Instances must be running for at least 30 hours (in some cases up to 60 hours) before recommendations will be generated.

Recommendations

The first recommendations will show up for eligible EC2 workloads within 12 hours after opting in and enabling the service on your account, the recommendations will refresh and generate daily. The recommendations will leverage five instance families: M (General Purpose), C (Compute Optimized), R (Memory Optimized), T (General purpose with burstable capabilities), and X (Memory Optimized, ideal for in-memory DBs). Any instance assigned with unsupported instance types, such as Accelerated Computing (P, G, and F) or Storage optimized (D, I, and H), will not see recommendations.

The service will analyze usage of EC2 instances and will categorize instances as one of the following (using AWS' language) and will generate sizing recommendations as needed:

Under-provisioned: An EC2 instance is considered under-provisioned when at least one specification of your instance such as CPU, memory or network, does not meet the performance requirements of your workload.
Over-provisioned: An EC2 instance is considered over-provisioned when at least one specification of your instance such as CPU, memory or network, can be sized down while still meeting the performance requirements of your workload, and when no specification is under-provisioned.
Optimized: An EC2 instance is considered optimized when all specifications of your instance such as CPU, memory and network, meet the performance requirements of your workload, and the instance is not over-provisioned. An optimized EC2 instance runs your workloads with optimal performance and infrastructure cost. For optimized resources, Compute Optimizer might sometimes recommend a new generation instance type.

When the tools generate a recommendation for EC2 instances, it will present the users with three options to choose from.

Auto Scaling Groups recommendations will focus on optimized and not-optimized auto-scaling groups.

The execution of the recommendation is manual. The tool will provide a direct link to the respected instance page, but users still need to manually stop, select the instance type, and start the instance again.

Analytics Engine

AWS didn't share a lot of details on the rightsizing analytics engine powering its new tool. What we do know is that it is based on Machine Learning. It analyzes the EC2 workloads' utilization across the supported metrics and uses insights from millions of workloads running on AWS to make its recommendations on which instance size to use.

Image source: https://aws.amazon.com/compute-optimizer/

Amazon did not provide details on the weight of observed peaks and average utilization in determining what precisely optimized, under-utilized, or over-utilized workload is, but it is safe to assume these are not configurable.

Lastly, it is not clear if the tool considers existing Reserved Instances or Savings Plans when it generates its recommendations or even considers the pending RI or Savings plans purchasing recommendations offered in Cost Explorer. This is important since sizing workloads without assessing the current RI inventory may result in higher on-demand charges, especially when doing cross-family sizing.

AWS Compute Optimizer vs. existing offerings from AWS

Compute Optimizer provides additional functionality vs. the other tools from AWS, namely AWS Trusted Advisor and Cost Explorer EC2 rightsizing recommendations.

The common elements among all tools are the non-configurable observation period of the last 14 days and the lack of any user-customization with regards to rightsizing analytics.

AWS Trusted Advisor would identify and alert when instances CPU utilization has exceeded a threshold to determine efficiency opportunities or performance risks. For example, if an instance CPU utilization was 10% or less and network IO was 5MB or less for four or more days, it will be marked as under-utilized. If CPU utilization was more than 90% for four or more days, it would be marked as over-utilized. These thresholds cannot be modified, and users will have to determine what new instance type to use.

The relatively new EC2 rightsizing recommendation service under Cost Explorer is also a threshold-based tool. Any instance with CPU peak below 1% over the last 14 days will be marked as Idle with a recommendation to terminate it (assuming it is not needed). Instances with CPU peak between 1% and 40% will have a size down recommendation but within the same instance family (for example from m4.xlarge to m4.large).

With Compute Optimizer, AWS is now offering cross-family rightsizing recommendations (within the five instance types it supports). Compute Optimizer also expands the recommendations to Auto Scaling groups beyond EC2.

Is Compute Optimizer enough?

No. Compute Optimizer is designed to help AWS users with selecting the right instance types for their workloads. However, when it comes to the "last-mile" of choosing the optimal instances and executing the rightsizing, the onus is on the user.

Furthermore, rightsizing is only one of several methods of reducing cost on the cloud, and it should never be done in isolation without considering the other available options. The other available options include using Reserved Instances (or the new AWS Savings Plans) or stopping instances when not needed, for example.

The cloud is very complex and applications are getting more complex, while customers are introducing more and more applications. This presents a challenge far beyond human scale. Leveraging CloudWatch is a great start, and collecting OS-level metrics via the CloudWatch agent or 3rd-party tool is better. However, the most effective approach is to let the application drive resource allocation, and this is where Turbonomic Application Resource Management comes in.

Turbonomic Application Resource Management offers the most powerful and advanced optimization functionality of any solution in the market. Here are a few examples to support this bold claim:

Top-down, application-aware approach - Turbonomic is designed first and foremost to assure application performance. Turbonomic offers multiple agentless options to connect, ingest and act upon application-level metrics such as Heap utilization, Database memory and application response times. The best way to reduce application costs is to focus on application performance first. When you are allocating the exact resources your application needs when it needs them, you will naturally achieve cost reduction without introducing performance risk. Simple as that!
You need trustworthy actions, not recommendations - as stated, when using the native cloud offerings, users still need to decide on which exact instance type to use, confirm it meets any constraints (such as drivers for the NICs (ENA) or storage (NVMe), and considering the impact on storage and network ). Then users have to execute them during the next maintenance window manually; This is not sustainable or scalable! Turbonomic generates trustworthy actions that can (and should) be scrutinized by users before executing them from the Turbonomic UI and once comfortable, fully automating, or even integrating with tools like ServiceNow for approval workflow and audit trail.
Different workloads require different optimizations approaches - although the Compute Optimizer uses ML to determine users' workload type and match them with few instance types, users told us they still want to configure the rightsizing engine to match their workloads' type. They also want to control the observation period, the aggressiveness of peaks (with percentiles), the target utilization or resources and to be able to apply these different policies to specific apps, accounts, or any group of resources created based on tags, for example.

It is also worth mentioning that unlike Compute Optimizer, which generates and refreshes recommendations daily, Turbonomic's actions are produced continuously in real-time as demand fluctuates.

Furthermore, Turbonomic will generate sizing actions after considering users' Reserved Instances inventory. For example, if an instance is using on-demand pricing and there is an available RI, Turbonomic will generate a sizing action to that RI even if the workload itself does not need to be resized. In this example, by taking this specific action by Turbonomic, you will benefit from eliminating on-demand costs and avoiding losses related to unutilized RIs for which you have already pre-paid.

Lastly, Turbonomic is a hybrid and multicloud system; it supports on-premises workloads running on any hypervisor, private cloud, or container platforms such as Pivotal Cloud Foundry or Kubernetes. Turbonomic also supports multicloud management, including AWS and Azure IaaS and PaaS services (including managed Kubernetes and Databases), as well as Google GCP for GKE optimization.

In summary, in the future, we should expect more native cost optimization solutions offered by cloud providers as customers continue to voice their frustration with increasing cloud bills. However, we should remember that the reasons for the cloud cost overruns are rooted in the fact that many operation teams default to the old (on-premises) habits of overprovisioning resources to assure performance (it is easier), but the only way to manage the tradeoffs between performance is with Application Resource Management.

要查看或添加评论，请登录

Jacob Ben-David的更多文章

It is 2021, and Cloud Cost Optimization is still Beyond Human Scale – Here's Why

2021年4月15日

It is 2021, and Cloud Cost Optimization is still Beyond Human Scale – Here's Why

In my role, I have the privilege of meeting and talking to multiple enterprise customers and prospects every day…

7 条评论
Mastering Cloud Cost Optimization: Cloud Cost Models & Discounts Overview

2020年6月19日

Mastering Cloud Cost Optimization: Cloud Cost Models & Discounts Overview

Originally published on Turbonomic's Blog Site. Welcome to the third article in our "Mastering Cloud Cost Optimization"…
AWS Trusted Advisor Explorer: Everything you need to know

2020年5月8日

AWS Trusted Advisor Explorer: Everything you need to know

Cost optimization is a top initiative for every organization - we know it, and the cloud providers know it. COVID-19…
How COVID-19 Transformed Cloud Optimization into an Act of Social Responsibility

2020年3月31日

How COVID-19 Transformed Cloud Optimization into an Act of Social Responsibility

The COVID-19 global pandemic continues to affect every organization on our planet. We've been seeing the significant…
The History of Cloud Computing: Two Decades in Review (Part 3)

2020年2月6日

The History of Cloud Computing: Two Decades in Review (Part 3)

Introduction In the first article in this blog series, we covered the emergence of cloud computing during the 2000s…
The History of Cloud Computing: Two Decades in Review (Part 2)

2020年1月29日

The History of Cloud Computing: Two Decades in Review (Part 2)

Introduction In this second part of our series, we will examine the exponential growth the cloud computing field…
The History of Cloud Computing: Two Decades in Review (Part1)

2020年1月21日

The History of Cloud Computing: Two Decades in Review (Part1)

Introduction As we enter a new decade, I decided to take a look back at the history of cloud computing and how the…

2 条评论
Machine Learning, Cost vs. Performance and The Edge - my re:Invent 2019 impressions

2019年12月11日

Machine Learning, Cost vs. Performance and The Edge - my re:Invent 2019 impressions

Last week approx. 65,000 IT professionals and executives converged on to Las Vegas for the most significant cloud event…
Introduction to AWS Savings Plans

2019年11月18日

Introduction to AWS Savings Plans

Originally post on Turbonomic's Blog page. On November 6, Amazon announced a new discount model called Savings Plans.
MULTICLOUD AS CODE: INTRODUCTION TO INFRASTRUCTURE AS CODE AND TERRAFORM (PART 1)

2019年6月7日

MULTICLOUD AS CODE: INTRODUCTION TO INFRASTRUCTURE AS CODE AND TERRAFORM (PART 1)

6 Minutes Read Introduction We, at Turbonomic, love automation: it eliminates manual, often repeatable tasks and chores…

See all articles

AWS Compute Optimizer and the art & science of optimization

Jacob Ben-David

AI & Cloud Business Leader at Red Hat

AWS Compute Optimizer Overview

Analyzed Metrics

Observation Period

Recommendations

Analytics Engine

AWS Compute Optimizer vs. existing offerings from AWS

Is Compute Optimizer enough?

Jacob Ben-David的更多文章

社区洞察

其他会员也浏览了

AWS update of Week 49 (4 Dec - 10 Dec)

AWS Under the Hood - Day 6?- Why doesn't AWS EC2 CloudWatch collect metrics like memory and disk utilization by default?

Beginner’s Guide To Amazon EC2 Instance Pricing

A Deep Dive into Amazon EC2: How It Works, Benefits, and a Food Delivery App Use Case

How to Integrate Datadog with EC2 for Monitoring

Understanding Amazon CloudWatch: Monitoring and Managing Your AWS Resources

AWS Compute Services: EC2, Lambda, and More

Unleashing Digital Symphony: AWS EC2 and the Revolution of Capacity on Demand

AWS Compute Services: A Comprehensive Guide

Reduce latency with the right AWS placement group

AWS Compute Optimizer Overview

Analyzed Metrics

Observation Period

Recommendations

Analytics Engine

AWS Compute Optimizer vs. existing offerings from AWS

Is Compute Optimizer enough?

Jacob Ben-David的更多文章

It is 2021, and Cloud Cost Optimization is still Beyond Human Scale – Here's Why

Mastering Cloud Cost Optimization: Cloud Cost Models & Discounts Overview

AWS Trusted Advisor Explorer: Everything you need to know

How COVID-19 Transformed Cloud Optimization into an Act of Social Responsibility

The History of Cloud Computing: Two Decades in Review (Part 3)

The History of Cloud Computing: Two Decades in Review (Part 2)

The History of Cloud Computing: Two Decades in Review (Part1)

Machine Learning, Cost vs. Performance and The Edge - my re:Invent 2019 impressions

Introduction to AWS Savings Plans

MULTICLOUD AS CODE: INTRODUCTION TO INFRASTRUCTURE AS CODE AND TERRAFORM (PART 1)

社区洞察

其他会员也浏览了

AWS update of Week 49 (4 Dec - 10 Dec)

AWS Under the Hood - Day 6?- Why doesn't AWS EC2 CloudWatch collect metrics like memory and disk utilization by default?

Beginner’s Guide To Amazon EC2 Instance Pricing

A Deep Dive into Amazon EC2: How It Works, Benefits, and a Food Delivery App Use Case

How to Integrate Datadog with EC2 for Monitoring

Understanding Amazon CloudWatch: Monitoring and Managing Your AWS Resources

AWS Compute Services: EC2, Lambda, and More

Unleashing Digital Symphony: AWS EC2 and the Revolution of Capacity on Demand

AWS Compute Services: A Comprehensive Guide

Reduce latency with the right AWS placement group