登录查看更多内容

Evolving FinOps: From Manual Monitoring to ML-Powered Optimization

Hassan Shuman

CTO | GenAI Pioneer | AWS & Azure Expert Transforming Enterprises with GenAI, Cloud Migration, and Innovation | CIO/CTO Advisor | ex-IBM, Accenture,

发布日期: 2023年8月18日

What is FinOps?

FinOps is a set of practices focused on cloud cost management and optimization. It involves cross-functional collaboration between finance, engineering, and cloud teams to provide visibility into spending, allocate costs, and continuously optimize the cloud environment.

In many organizations, FinOps still relies heavily on manual processes:

Finance teams download detailed cost and usage reports from the cloud provider to analyze spending trends across services. This provides basic visibility.
Engineers may review cloud architecture diagrams and configs during development for cost optimization best practices. But this is often an afterthought.
Cloud admins monitor dashboards and alerts for overspending symptoms like spikes. But the analysis to find root causes is manual.
Unused resources get identified through manually scanning for low utilization. Right-sizing happens reactively.
Cost allocation requires engineering teams to manually add tags and attributes to resources for finance to report on.

There are clear challenges with the manual approach:

Analysing huge amounts of granular cloud spend data is time consuming and error prone. Insights are delayed.
Reviewing configs and architectures for cost optimization rarely happens proactively. Issues get flagged late.
Optimizing spend reactively means costs have already been incurred before action is taken.
Lack of standards and a reactive culture makes cost efficiency difficult to scale across environments.

How Machine Learning will help

There are many ways we can automate the FinOps activities, but here are just some examples

Predicting Resource Requirements

Build time series forecasting models on historical usage data for resources like EC2, RDS, and Lambda.
Models can detect trends, seasonal patterns, and cycles in usage over time.
Forecasts provide expected resource requirements for coming weeks and months.
Cloud providers offer ML services like AWS Forecast and Azure ML to build models.
Models enable proactive planning of instance sizes, RIs, and resource needs vs reactive guesses.

Detecting Optimization Opportunities

Apply unsupervised ML techniques like clustering and anomaly detection on usage data.?
Identify low utilization resources by detecting outliers in usage volume and patterns.
Detect usage or spend spikes that deviate significantly from baselines.
Build regression models to estimate expected costs for workloads. Compare to actuals.
Cloud provider tools can assist, like AWS Cost Anomaly Detection.
Models identify optimization opportunities proactively vs analysing historically.

Recommending Optimal Configurations

Train models on historical data of past resource sizing decisions and resulting utilization.
For new apps/use cases, model recommends optimal instance type, DB size etc. based on predicted usage.
Can optimize for cost, performance, or balance of both.
Continuously improve models by incorporating new data on size choices and impact.
Helps eliminate manual guesswork in configuring resources.
Cloud providers offer ML-powered recommendation services like AWS Compute Optimizer.

领英推荐

Containers on AWS: A Comprehensive Guide for Beginners

Neal K. Davis 2 个月前

Scaling Applications with Microservices on AWS: A…

Tekvaly 2 个月前

Discovering 5 Practical DevOps Applications with…

CloudZenix LLC 1 年前

Intelligent Reserved Instance Planning

Analyse usage trends, cyclical demand patterns, growth rates using time series models.
Project future resource needs based on product roadmaps and timelines.
Recommend RI purchase quantity, term length, and renewal timing based on forecasts.
Optimize trade-off between RI discount savings and unused capacity risk.?
Services like AWS Cost Explorer provide RI recommendations powered by ML algorithms.

How this can be deployed

We can use a lot of the native ML services on AWS and Azure to reduce the manual interventions to auto delete unused servers, auto deploy shutdowns and automatically right size virtual machines without the need to go through design reviews and the change management process

AWS

Use Amazon Forecast's pre-built algorithms for forecasting future EC2 and RDS capacity needs. Input relevant time series data like past CPU utilization, network I/O, and memory consumption as predictors. Amazon Forecast will analyze trends and seasonal patterns in the data to generate forecasts.
In AWS Cost Explorer, enable the anomaly detection feature. This will automatically monitor your spend data across services and resources to identify unusual variances or spikes compared to past trends. Review the anomaly detection notifications and drill into specifics on which services or resources are contributing to the unusual spend.
Build custom forecasting models in Amazon SageMaker using algorithms like DeepAR+ (Autoregressive Recurrent Neural Network) or Prophet (Facebook's open source time series forecasting model) on granular time series billing, usage and metrics data from AWS Cost and Usage Reports. Train the models to predict expected resource requirements for EC2, RDS, Lambda etc. for future periods.
?Create Lambda functions that are triggered by CloudWatch alarms when utilization metrics like CPU usage breach certain thresholds. The Lambda function can then take actions like automatically right-sizing EC2 instances or stopping/starting RDS databases based on the usage predictions and recommendations generated by your machine learning models.
Use Amazon Athena to query and analyze granular cost and usage data stored in S3, filtering and aggregating by tags. Visualize the Athena query results in QuickSight dashboards showing spend trends and breakdowns by resource types, applications, departments etc. This provides allocation and showback of costs.

Azure

Apply Azure Machine Learning's AutoML time series forecasting capabilities using the TCNForecaster algorithm on resource usage telemetry and metrics from Azure Monitor logs. This will automatically analyze trends, cycles, and seasonal patterns in the historical data to predict future capacity requirements for VMs, SQL DBs etc.
Ingest granular Azure billing data into Azure Log Analytics workspaces. Build anomaly detection machine learning models using the Log Analytics API to analyze the data and identify outliers or unusual spikes in spend compared to historical trends.
Use the Anomaly Detector API from Azure Cognitive Services on time series cost and usage data. The API will detect anomalies across multiple dimensions like resource types, environments, departments etc. flagged by tags and metadata.
Review and consume Azure Advisor recommendations on optimizing spend through actions like right-sizing underutilized resources, purchasing reserved instances, and adjusting resource configurations. Implement the recommendations programmatically at scale using the Azure Advisor REST API.
Create Azure Policy definitions with "DeployIfNotExists" effects that enforce machine learning recommended configurations like VM SKU sizes and SQL database sizing for cost optimization when new resources are deployed.

The Future with Generative AI

?The capabilities of ML and automation for FinOps will expand even further with future advancements in generative AI. Some potential areas where generative models could provide additional benefits:

Natural language processing of product requirements documents, code comments, and architecture designs could allow AI systems to automatically extract cost optimization considerations and suggest improvements early in the development lifecycle.
Large language models like GPT-3 could potentially generate optimized cloud infrastructure configurations and architecture designs for new applications and use cases based on a few prompts describing required performance, scalability, and cost guardrails.
Chatbots and digital assistants leveraging generative AI could provide real-time FinOps advice and recommendations customized to an engineer's specific workload and context when provisioning resources or diagnosing issues.
Generative AI could help rapidly analyze huge volumes of granular cost data, draw causational insights, identify optimization hypotheses, and generate reports in natural language.
Testing and simulating potential optimization strategies to model their impact could be accelerated by orders of magnitude using AI simulation systems.

In summary, generative AI has the potential to take cloud cost management and FinOps practices to new levels in terms of automation, customization, and intelligence augmentation. Organizations should follow developments in this emerging space closely to identify new opportunities for efficiency gains. But as always, balancing innovation with rigorous testing and governance will be key.

Harry Dove

Helping customers with their cyber security

1 年

This is all good stuff. I believe that there are three main stages when looking at optimisation of cloud costs: 1. Cloud Native optimisation: This includes using the features and recommendations available natively in the specified cloud. This should deliver a level of optimisation, but will likely leave a large scope for further optimisation. 2. Cloud Additive optimisation: Using tools available in the market to provide additional optimisation- this could include solutions focussed on specific areas, eg compute or storage. At this stage it is wise to sense check the methodologies that you are following and exploring if there are alternative ways of provisioning in the cloud that deliver cost savings whilst balancing and minimising any administrative or operational overhead that the saving incurs. 3. Inefficient optimisation: This is realising the point at which further optimisation creates additional cost elsewhere and so any gains achieved are likely negated. I think that using ML or GenAI models at this stage venture into risky territory when you consider the impact that an incorrect recommendation could have, I feel a lot safer using industry tools that are independent of the cloud in their place for the time being.

Dharmesh Mistry

CIO Advisory Cloud Consultant

1 年

Great article surmising a very large topic that spans across organisations and has the potential, with the right approach from the best teams, to reap great benefits. A coupe of the biggest challenges for organisations is defining and achievable scope and how to change in this digital revolution.

1 次回应

Hassan Shuman

CTO | GenAI Pioneer | AWS & Azure Expert Transforming Enterprises with GenAI, Cloud Migration, and Innovation | CIO/CTO Advisor | ex-IBM, Accenture,

1 年

Rakhi Gupta who is also leading the FinOps capability globally

1 次回应

查看更多评论

要查看或添加评论，请登录

Hassan Shuman的更多文章

Azure Skies and AI Dreams! DIY Algo-Trading Goes Tri-Cloud with a Side of Gemini Genius (and Sleep Deprivation)

2024年7月29日

Azure Skies and AI Dreams! DIY Algo-Trading Goes Tri-Cloud with a Side of Gemini Genius (and Sleep Deprivation)

My brain is officially fried from sleep deprivation. Let's just say that insomnia has become my unwelcome roommate,and…

1 条评论
A Year On: Securing IoT Healthcare Through the Case of Pacemakers

2023年12月20日

A Year On: Securing IoT Healthcare Through the Case of Pacemakers

A year ago, I wrote a broad article exploring how IoT could transform healthcare through remote monitoring, data…

1 条评论
OpenAI ChatGPT vs Amazon Bedrock: A Comparison of Generative AI Services

2023年11月20日

OpenAI ChatGPT vs Amazon Bedrock: A Comparison of Generative AI Services

Generative AI is a branch of artificial intelligence that can create new content, such as text, images, audio, and…

3 条评论
The Unlikely Sewing Circle: How Tailors and Data Scientists Craft Perfection

2023年11月16日

The Unlikely Sewing Circle: How Tailors and Data Scientists Craft Perfection

Introduction At first glance, tailors and data scientists seem an unlikely pair. But they share surprising similarities…

1 条评论
Trust AI to modernize your code

2023年10月18日

Trust AI to modernize your code

Legacy monolithic systems - we've all dealt with them. Massive codebases with everything tangled together into one big…

3 条评论
A Game Theoretic Approach to Optimizing FinOps Decisions

2023年10月3日

A Game Theoretic Approach to Optimizing FinOps Decisions

Introduction This article explores how basic game theory concepts can be applied to enhance data-driven decision making…

1 条评论
How AI is Transforming Patient Care in the NHS

2023年8月22日

How AI is Transforming Patient Care in the NHS

A year ago, I wrote about the transformative potential of artificial intelligence (AI) in healthcare. I discussed how…
Practicing Responsible AI with No-Code Machine Learning

2023年8月15日

Practicing Responsible AI with No-Code Machine Learning

In my previous article, I provided an overview of the emerging no-code machine learning movement that is making AI more…

2 条评论
FinOps on Azure - Considerations for Cost Management

2023年8月8日

FinOps on Azure - Considerations for Cost Management

Introduction As a CTO who recently spent time building Azure Functions in C#, I understand the platform's capabilities…

2 条评论
Coding is Making Me a Better CTO Advisor

2023年8月3日

Coding is Making Me a Better CTO Advisor

After an intensive few weeks of late-night C# coding binges, I'm closing a chapter on an assignment that has deeply…

See all articles

Evolving FinOps: From Manual Monitoring to ML-Powered Optimization

Hassan Shuman

CTO | GenAI Pioneer | AWS & Azure Expert Transforming Enterprises with GenAI, Cloud Migration, and Innovation | CIO/CTO Advisor | ex-IBM, Accenture,

What is FinOps?

How Machine Learning will help

Predicting Resource Requirements

Detecting Optimization Opportunities

Recommending Optimal Configurations

领英推荐

Intelligent Reserved Instance Planning

How this can be deployed

AWS

Azure

The Future with Generative AI

Hassan Shuman的更多文章

社区洞察

其他会员也浏览了

?? Your Ultimate Resource for 110+ Real-Time Use Cases Across Cloud & DevOps! ??

A Step-by-Step Guide to Microservices on AWS

Setting Up a GitOps-Ready Cluster on Azure with AKS, Argo CD, and Terraform

From Code to Cloud: 12-Factor Microservices Explained

Transforming Application Development: A Deep Dive into Cloud-Native Technologies and Best Practices

Serverless Microservices: The Next Frontier

AI in FinOps: Are You Ready to Run?

Gremlin for AWS release, migration tips for Kubernetes, and microservice reliability

Optimizing your Kubernetes Nodes Using FinOps Tools? Try this Instead.

AWS Monitoring and Debugging Services: A Comprehensive Guide

What is FinOps?

How Machine Learning will help

Predicting Resource Requirements

Detecting Optimization Opportunities

Recommending Optimal Configurations

领英推荐

Intelligent Reserved Instance Planning

How this can be deployed

AWS

Azure

The Future with Generative AI

Hassan Shuman的更多文章

Azure Skies and AI Dreams! DIY Algo-Trading Goes Tri-Cloud with a Side of Gemini Genius (and Sleep Deprivation)

A Year On: Securing IoT Healthcare Through the Case of Pacemakers

OpenAI ChatGPT vs Amazon Bedrock: A Comparison of Generative AI Services

The Unlikely Sewing Circle: How Tailors and Data Scientists Craft Perfection

Trust AI to modernize your code

A Game Theoretic Approach to Optimizing FinOps Decisions

How AI is Transforming Patient Care in the NHS

Practicing Responsible AI with No-Code Machine Learning

FinOps on Azure - Considerations for Cost Management

Coding is Making Me a Better CTO Advisor

社区洞察

其他会员也浏览了

?? Your Ultimate Resource for 110+ Real-Time Use Cases Across Cloud & DevOps! ??

A Step-by-Step Guide to Microservices on AWS

Setting Up a GitOps-Ready Cluster on Azure with AKS, Argo CD, and Terraform

From Code to Cloud: 12-Factor Microservices Explained

Transforming Application Development: A Deep Dive into Cloud-Native Technologies and Best Practices

Serverless Microservices: The Next Frontier

AI in FinOps: Are You Ready to Run?

Gremlin for AWS release, migration tips for Kubernetes, and microservice reliability

Optimizing your Kubernetes Nodes Using FinOps Tools? Try this Instead.

AWS Monitoring and Debugging Services: A Comprehensive Guide