What is FinOps?
FinOps is a set of practices focused on cloud cost management and optimization. It involves cross-functional collaboration between finance, engineering, and cloud teams to provide visibility into spending, allocate costs, and continuously optimize the cloud environment.
In many organizations, FinOps still relies heavily on manual processes:
- Finance teams download detailed cost and usage reports from the cloud provider to analyze spending trends across services. This provides basic visibility.
- Engineers may review cloud architecture diagrams and configs during development for cost optimization best practices. But this is often an afterthought.
- Cloud admins monitor dashboards and alerts for overspending symptoms like spikes. But the analysis to find root causes is manual.
- Unused resources get identified through manually scanning for low utilization. Right-sizing happens reactively.
- Cost allocation requires engineering teams to manually add tags and attributes to resources for finance to report on.
There are clear challenges with the manual approach:
- Analysing huge amounts of granular cloud spend data is time consuming and error prone. Insights are delayed.
- Reviewing configs and architectures for cost optimization rarely happens proactively. Issues get flagged late.
- Optimizing spend reactively means costs have already been incurred before action is taken.
- Lack of standards and a reactive culture makes cost efficiency difficult to scale across environments.
How Machine Learning will help
There are many ways we can automate the FinOps activities, but here are just some examples
Predicting Resource Requirements
- Build time series forecasting models on historical usage data for resources like EC2, RDS, and Lambda.
- Models can detect trends, seasonal patterns, and cycles in usage over time.
- Forecasts provide expected resource requirements for coming weeks and months.
- Cloud providers offer ML services like AWS Forecast and Azure ML to build models.
- Models enable proactive planning of instance sizes, RIs, and resource needs vs reactive guesses.
Detecting Optimization Opportunities
- Apply unsupervised ML techniques like clustering and anomaly detection on usage data.?
- Identify low utilization resources by detecting outliers in usage volume and patterns.
- Detect usage or spend spikes that deviate significantly from baselines.
- Build regression models to estimate expected costs for workloads. Compare to actuals.
- Cloud provider tools can assist, like AWS Cost Anomaly Detection.
- Models identify optimization opportunities proactively vs analysing historically.
Recommending Optimal Configurations
- Train models on historical data of past resource sizing decisions and resulting utilization.
- For new apps/use cases, model recommends optimal instance type, DB size etc. based on predicted usage.
- Can optimize for cost, performance, or balance of both.
- Continuously improve models by incorporating new data on size choices and impact.
- Helps eliminate manual guesswork in configuring resources.
- Cloud providers offer ML-powered recommendation services like AWS Compute Optimizer.
Intelligent Reserved Instance Planning
- Analyse usage trends, cyclical demand patterns, growth rates using time series models.
- Project future resource needs based on product roadmaps and timelines.
- Recommend RI purchase quantity, term length, and renewal timing based on forecasts.
- Optimize trade-off between RI discount savings and unused capacity risk.?
- Services like AWS Cost Explorer provide RI recommendations powered by ML algorithms.
How this can be deployed
We can use a lot of the native ML services on AWS and Azure to reduce the manual interventions to auto delete unused servers, auto deploy shutdowns and automatically right size virtual machines without the need to go through design reviews and the change management process
AWS
- Use Amazon Forecast's pre-built algorithms for forecasting future EC2 and RDS capacity needs. Input relevant time series data like past CPU utilization, network I/O, and memory consumption as predictors. Amazon Forecast will analyze trends and seasonal patterns in the data to generate forecasts.
- In AWS Cost Explorer, enable the anomaly detection feature. This will automatically monitor your spend data across services and resources to identify unusual variances or spikes compared to past trends. Review the anomaly detection notifications and drill into specifics on which services or resources are contributing to the unusual spend.
- Build custom forecasting models in Amazon SageMaker using algorithms like DeepAR+ (Autoregressive Recurrent Neural Network) or Prophet (Facebook's open source time series forecasting model) on granular time series billing, usage and metrics data from AWS Cost and Usage Reports. Train the models to predict expected resource requirements for EC2, RDS, Lambda etc. for future periods.
- ?Create Lambda functions that are triggered by CloudWatch alarms when utilization metrics like CPU usage breach certain thresholds. The Lambda function can then take actions like automatically right-sizing EC2 instances or stopping/starting RDS databases based on the usage predictions and recommendations generated by your machine learning models.
- Use Amazon Athena to query and analyze granular cost and usage data stored in S3, filtering and aggregating by tags. Visualize the Athena query results in QuickSight dashboards showing spend trends and breakdowns by resource types, applications, departments etc. This provides allocation and showback of costs.
Azure
- Apply Azure Machine Learning's AutoML time series forecasting capabilities using the TCNForecaster algorithm on resource usage telemetry and metrics from Azure Monitor logs. This will automatically analyze trends, cycles, and seasonal patterns in the historical data to predict future capacity requirements for VMs, SQL DBs etc.
- Ingest granular Azure billing data into Azure Log Analytics workspaces. Build anomaly detection machine learning models using the Log Analytics API to analyze the data and identify outliers or unusual spikes in spend compared to historical trends.
- Use the Anomaly Detector API from Azure Cognitive Services on time series cost and usage data. The API will detect anomalies across multiple dimensions like resource types, environments, departments etc. flagged by tags and metadata.
- Review and consume Azure Advisor recommendations on optimizing spend through actions like right-sizing underutilized resources, purchasing reserved instances, and adjusting resource configurations. Implement the recommendations programmatically at scale using the Azure Advisor REST API.
- Create Azure Policy definitions with "DeployIfNotExists" effects that enforce machine learning recommended configurations like VM SKU sizes and SQL database sizing for cost optimization when new resources are deployed.
The Future with Generative AI
?The capabilities of ML and automation for FinOps will expand even further with future advancements in generative AI. Some potential areas where generative models could provide additional benefits:
- Natural language processing of product requirements documents, code comments, and architecture designs could allow AI systems to automatically extract cost optimization considerations and suggest improvements early in the development lifecycle.
- Large language models like GPT-3 could potentially generate optimized cloud infrastructure configurations and architecture designs for new applications and use cases based on a few prompts describing required performance, scalability, and cost guardrails.
- Chatbots and digital assistants leveraging generative AI could provide real-time FinOps advice and recommendations customized to an engineer's specific workload and context when provisioning resources or diagnosing issues.
- Generative AI could help rapidly analyze huge volumes of granular cost data, draw causational insights, identify optimization hypotheses, and generate reports in natural language.
- Testing and simulating potential optimization strategies to model their impact could be accelerated by orders of magnitude using AI simulation systems.
In summary, generative AI has the potential to take cloud cost management and FinOps practices to new levels in terms of automation, customization, and intelligence augmentation. Organizations should follow developments in this emerging space closely to identify new opportunities for efficiency gains. But as always, balancing innovation with rigorous testing and governance will be key.
Helping customers with their cyber security
1 年This is all good stuff. I believe that there are three main stages when looking at optimisation of cloud costs: 1. Cloud Native optimisation: This includes using the features and recommendations available natively in the specified cloud. This should deliver a level of optimisation, but will likely leave a large scope for further optimisation. 2. Cloud Additive optimisation: Using tools available in the market to provide additional optimisation- this could include solutions focussed on specific areas, eg compute or storage. At this stage it is wise to sense check the methodologies that you are following and exploring if there are alternative ways of provisioning in the cloud that deliver cost savings whilst balancing and minimising any administrative or operational overhead that the saving incurs. 3. Inefficient optimisation: This is realising the point at which further optimisation creates additional cost elsewhere and so any gains achieved are likely negated. I think that using ML or GenAI models at this stage venture into risky territory when you consider the impact that an incorrect recommendation could have, I feel a lot safer using industry tools that are independent of the cloud in their place for the time being.
CIO Advisory Cloud Consultant
1 年Great article surmising a very large topic that spans across organisations and has the potential, with the right approach from the best teams, to reap great benefits. A coupe of the biggest challenges for organisations is defining and achievable scope and how to change in this digital revolution.
CTO | GenAI Pioneer | AWS & Azure Expert Transforming Enterprises with GenAI, Cloud Migration, and Innovation | CIO/CTO Advisor | ex-IBM, Accenture,
1 年Rakhi Gupta who is also leading the FinOps capability globally