登录查看更多内容

From Thresholds to Impact: The Future of Observability with AI and ML

Yoseph Reuveni

发布日期: 2024年9月24日

In today’s complex engineering landscapes, observability is not merely about tracking metrics but understanding the larger patterns that affect system performance. Traditional methods such as static rules and thresholds are increasingly becoming inefficient, especially when systems scale and require constant tuning. Machine learning (ML) offers a dynamic alternative, but both approaches suffer from similar challenges. They require continuous tuning, which is time-consuming, reactive, and often fails to prioritize the real impacts of system anomalies.

This article explores how shifting from granular, metric-level monitoring to a 10,000-feet view using multivariate models can provide deeper insights. We’ll also consider the cost implications of these methods, and how automation can significantly scale the capacity of engineering teams without a proportional increase in cost.

The Static Rules Approach: Quick Fix, Long-Term Pain

Historically, static thresholds have been the go-to solution for observability. Teams set pre-defined thresholds—such as CPU usage, memory consumption, or response times—and trigger alerts whenever these thresholds are crossed. This straightforward system offers clear advantages in simple, stable environments but quickly becomes unmanageable as system complexity increases.

For example, an organization might set a static threshold where an alert is triggered if CPU usage exceeds 80%. The problem arises when this threshold doesn’t account for variable conditions, such as peak versus off-peak traffic. This results in:

False Positives: Triggering alerts even when no action is necessary.
Manual Effort: Teams spend countless hours fine-tuning static thresholds to reflect new system behaviors, especially as environments evolve.
Siloed Insights: Each metric is treated in isolation, which doesn't give you the full picture of the system’s overall health.

Machine Learning: The Promise of Dynamism, But Not Without Challenges

Machine learning models offer a more flexible and adaptive way of managing observability. Instead of rigidly defining thresholds, these models learn from historical data to establish baselines and dynamically adjust alerting rules.

For instance, an ML model might understand that CPU usage fluctuates between 60-90% depending on traffic patterns, and it will alert only when usage deviates significantly from this learned behavior. However, ML models bring their own set of challenges:

Continuous Tuning: Like static thresholds, ML models also need retraining and adjustments to stay relevant. This process, while more intelligent, is still resource-intensive.
Noisy Signals: ML models can still generate a lot of noise—alerts that don’t point to critical issues, but rather minor anomalies that don’t need immediate attention.
Complexity and Trust: The "black box" nature of many ML models means engineers often struggle to understand why a specific alert was triggered, making it hard to trust the system.

Why Constant Tuning is Inefficient

Whether you’re using static thresholds or machine learning, the common thread is the constant need for manual tuning. Both approaches rely on reacting to individual metric deviations, without considering the larger impact or relationships between metrics. This process is inefficient because it’s:

Time-consuming: Engineers spend countless hours adjusting thresholds or retraining ML models.
Reactive: Teams often find themselves in a cycle of reacting to anomalies without addressing the root cause.
Micro-focused: The focus remains on individual metrics rather than a broader, more meaningful view of system health.

The 10,000-Feet Solution: Correlating and Clustering Impact

A more effective approach involves moving beyond the constant fine-tuning of thresholds and instead using a 10,000-feet view to focus on systemic impact. This method takes threshold breaches as inputs to measure impact and clusters similar breaches for deeper analysis.

Extract Impact Level: Rather than treating every breach as critical, the system should first assess the broader impact—is this breach causing real user degradation, financial loss, or downtime?
Correlate Signals: Instead of isolating CPU usage, memory, or latency, correlate multiple metrics to understand how they interact. For instance, high CPU usage might coincide with increased latency and a drop in transactions, revealing a more significant issue.
Cluster Based on Impact: Use machine learning to cluster correlated anomalies. This helps engineering teams focus on groups of related issues, minimizing alert fatigue and prioritizing critical events.

Multivariate Models: Correlating and Explaining Signals

Here’s where multivariate models shine. Unlike traditional observability tools that focus on individual metrics, multivariate models analyze multiple metrics simultaneously to understand their relationships. For example:

A sudden spike in network traffic might trigger a rise in CPU usage, which in turn affects response times. A multivariate model would detect these relationships, triggering an alert only when the combined deviation from expected behavior signals a real issue.

The benefits of using multivariate models include:

领英推荐

50% Off ODSC East 2025 Passes, Prompt Engineering…

Open Data Science Conference (ODSC) 1 个月前

AI Technology in City Planning, Lifecycle Modeling…

Project Performance International (PPI) 3 个月前

Nitor Infotech’s May Tech Bulletin

Nitor Infotech, an Ascendion Company 9 个月前

Rich Insights: By analyzing multiple metrics, these models provide deeper, more contextual alerts. This reduces noise and highlights critical, system-wide issues.
Clustering: Multivariate models can automatically cluster related anomalies, making it easier for teams to identify the root cause of systemic problems.
Explainability: When an anomaly is detected, multivariate models can explain why it was triggered by showing the relationships between metrics. This builds trust in the system and speeds up resolution.

The Cost of Observability: Scaling Without Scaling Costs

Observability doesn’t come without costs. Let’s break down a typical scenario:

A single engineer, at a cost of $50/hour, can eyeball 4-5 dashboards, each with 20-30 panels. These dashboards might represent key performance metrics such as CPU, memory, latency, and network traffic. By eyeballing these panels, an experienced engineer can extract anomalies, but this is a time-consuming process that limits scalability.

Now, let’s compare that with an automated system leveraging thresholds and ML anomaly detection:

Scalability: The machine can monitor 10x more metrics in real-time than a human engineer. This means one machine can effectively replace multiple engineers in terms of monitoring capacity.
Efficiency: The system automatically flags anomalies, which are then reviewed by engineers only when necessary. Instead of continuously scanning metrics, engineers are only alerted when there is something worth investigating, drastically reducing manual effort.
Real-World Example: If an engineer working at $50/hour can effectively monitor 5 dashboards with 30 panels, that’s about 150 metrics per hour. With automation, the system could monitor 10x that—1500 metrics per hour—while ensuring every anomaly is flagged and analyzed. Engineers only need to step in when a flagged anomaly correlates with real degradation, saving both time and costs.

In this model, automation amplifies the engineer’s capacity, cutting costs while improving system performance and reducing the chances of missing critical incidents.

Reinforcement Learning and Feedback Loops

The next evolution of observability involves integrating feedback loops and reinforcement learning models into the monitoring system. Feedback from engineers is invaluable in improving the accuracy of alerts and optimizing monitoring systems over time. This feedback can take several forms:

False Positive: The system flagged an anomaly, but there was no real impact. By marking these cases, engineers can train the system to avoid similar false positives in the future.
Anomaly but No Impact: While the anomaly is real, it didn’t lead to any significant degradation or system failure. This helps the system understand which types of anomalies are worth flagging and which are less critical.
Anomaly with Impact: In this case, the anomaly detected by the system did lead to significant degradation or downtime. This feedback reinforces the system’s decision-making, helping it improve its ability to detect critical issues.

Over time, this feedback loop allows the monitoring system to auto-tune itself, adjusting thresholds and anomaly detection mechanisms based on real-world outcomes. This reduces the need for manual tuning and allows the system to become smarter and more efficient with each iteration.

The Role of False Negatives in Post-Mortems

False negatives—cases where an anomaly was not detected but did lead to system degradation—are just as important as false positives. These often reveal blind spots in the monitoring system and provide critical data during post-mortems. By analyzing false negatives, engineers can:

Identify Gaps in Coverage: Understand which metrics or patterns were missed by the system and why.
Improve Model Training: Use the missed data points to retrain machine learning models, improving their ability to detect similar issues in the future.
Enhance System Reliability: Learning from false negatives reduces the likelihood of repeat issues, strengthening overall system reliability.

Final Thoughts: A Holistic Approach to Observability

The future of observability lies in integrating static thresholds, machine learning, multivariate models, and reinforcement learning into a cohesive, impact-driven system. By continuously learning from feedback—whether it’s false positives, anomalies with no impact, or critical system failures—observability systems can become smarter, more efficient, and more aligned with real-world needs.

Incorporating feedback loops and reinforcement learning models into observability tools allows teams to focus on what truly matters: reducing noise, improving accuracy, and enhancing system reliability. At the same time, automation dramatically reduces the costs associated with manual monitoring, allowing teams to scale without scaling costs.

#Observability #MachineLearning #ReinforcementLearning #MultivariateModels #FeedbackLoops #FalsePositives #FalseNegatives #PostMortems #ImpactDriven #Automation #CostEfficiency #Clustering #Explainability #OperationalExcellence #DevOps #SRE #TechInnovation

Gibril Gomez

4 个月

Please, check out this AI-driven healthcare eBook. Get your copy now ?? https://selar.co/Ai4PatientCare | Implementing AI & Machine Learning in Patient Care eBook is now available. Filled with relevant information and contexts, this eBook is a must-read for healthcare professionals, tech experts, and public and digital health specialists.

Peter Ricca

President at CJR Associates, Inc.

5 个月

This is really great stuff Yoseph Reuveni, curious your take on observability as it relates to LLM performance as well and how you guys are tackling that

1 次回应

查看更多评论

要查看或添加评论，请登录

Yoseph Reuveni的更多文章

Automated Testing and Observability: SRE’s Toolkit for Success

2025年1月22日

Automated Testing and Observability: SRE’s Toolkit for Success

In today’s fast-paced digital landscape, ensuring system reliability, scalability, and seamless user experiences is…

2 条评论
Cultural Change in Engineering: Why SREs are Essential

2025年1月21日

Cultural Change in Engineering: Why SREs are Essential

In today’s fast-paced digital landscape, where downtime can cost millions of dollars and customer expectations are…

1 条评论
The Role of SRE in Driving Observability for AI and GenAI Systems

2025年1月20日

The Role of SRE in Driving Observability for AI and GenAI Systems

In the era of Artificial Intelligence (AI) and Generative AI (GenAI), where systems are becoming increasingly complex…

1 条评论
Automating Everything: How SREs are Revolutionizing MLOps Pipelines

2025年1月17日

Automating Everything: How SREs are Revolutionizing MLOps Pipelines

In today’s fast-paced digital era, businesses are increasingly dependent on data-driven decision-making powered by…

2 条评论
Operational Culture and GenAI: SRE’s Role in Navigating Change

2025年1月16日

Operational Culture and GenAI: SRE’s Role in Navigating Change

In today’s fast-paced tech landscape, where innovation shapes every facet of business operations, the intersection of…
SRE and Observability: Building a Resilient Engineering Culture

2025年1月15日

SRE and Observability: Building a Resilient Engineering Culture

In the fast-paced world of modern software development, delivering reliable, scalable, and efficient systems is…

4 条评论
MLOps Automation: SRE’s Role in Shaping the Future of AI

2025年1月14日

MLOps Automation: SRE’s Role in Shaping the Future of AI

In an era where artificial intelligence (AI) and machine learning (ML) are transforming industries, ensuring the…

2 条评论
Observability as a Cultural Change Enabler in Engineering Teams

2025年1月13日

Observability as a Cultural Change Enabler in Engineering Teams

The rise of complex distributed systems and microservices architectures has transformed the landscape of software…

7 条评论
Scaling Engineering Culture with SRE and Observability

2025年1月9日

Scaling Engineering Culture with SRE and Observability

In today’s rapidly evolving tech landscape, organizations face a dual challenge: scaling their systems to meet…
MLOps at Scale: How SRE Ensures Operational Success

2024年12月30日

MLOps at Scale: How SRE Ensures Operational Success

As artificial intelligence (AI) and machine learning (ML) continue to redefine industries, the need for operational…

See all articles

From Thresholds to Impact: The Future of Observability with AI and ML

Yoseph Reuveni

The Static Rules Approach: Quick Fix, Long-Term Pain

Machine Learning: The Promise of Dynamism, But Not Without Challenges

Why Constant Tuning is Inefficient

The 10,000-Feet Solution: Correlating and Clustering Impact

Multivariate Models: Correlating and Explaining Signals

领英推荐

The Cost of Observability: Scaling Without Scaling Costs

Reinforcement Learning and Feedback Loops

The Role of False Negatives in Post-Mortems

Final Thoughts: A Holistic Approach to Observability

#Observability #MachineLearning #ReinforcementLearning #MultivariateModels #FeedbackLoops #FalsePositives #FalseNegatives #PostMortems #ImpactDriven #Automation #CostEfficiency #Clustering #Explainability #OperationalExcellence #DevOps #SRE #TechInnovation

Yoseph Reuveni的更多文章

社区洞察

其他会员也浏览了

How AI is changing the rules for Software and Hardware design

Image Enhancement Technology: OPT New Series Smart Code Reader - designed for Precise Decoding

Almost Timely News: Principles-Based Prompt Engineering (2024-02-25)

Atinary Emmental: ML Algorithm for Non-linear Constrained Optimization

RA-DT: Retrieval-Augmented Decision Transformer

Object Detection Apps and Software for Business

Launching Tomorrow 'Prompt Engineering for GenAI' Course | Feeling lost in the financial jungle?

The Rise of AI Engineering

The Future of Manufacturing: A Look at Machine Learning Algorithms

Cortex AI Engineering: Paving The Way To Client-Centric AI

The Static Rules Approach: Quick Fix, Long-Term Pain

Machine Learning: The Promise of Dynamism, But Not Without Challenges

Why Constant Tuning is Inefficient

The 10,000-Feet Solution: Correlating and Clustering Impact

Multivariate Models: Correlating and Explaining Signals

领英推荐

The Cost of Observability: Scaling Without Scaling Costs

Reinforcement Learning and Feedback Loops

The Role of False Negatives in Post-Mortems

Final Thoughts: A Holistic Approach to Observability

#Observability #MachineLearning #ReinforcementLearning #MultivariateModels #FeedbackLoops #FalsePositives #FalseNegatives #PostMortems #ImpactDriven #Automation #CostEfficiency #Clustering #Explainability #OperationalExcellence #DevOps #SRE #TechInnovation

Yoseph Reuveni的更多文章

Automated Testing and Observability: SRE’s Toolkit for Success

Cultural Change in Engineering: Why SREs are Essential

The Role of SRE in Driving Observability for AI and GenAI Systems

Automating Everything: How SREs are Revolutionizing MLOps Pipelines

Operational Culture and GenAI: SRE’s Role in Navigating Change

SRE and Observability: Building a Resilient Engineering Culture

MLOps Automation: SRE’s Role in Shaping the Future of AI

Observability as a Cultural Change Enabler in Engineering Teams

Scaling Engineering Culture with SRE and Observability

MLOps at Scale: How SRE Ensures Operational Success

社区洞察

其他会员也浏览了

How AI is changing the rules for Software and Hardware design

Image Enhancement Technology: OPT New Series Smart Code Reader - designed for Precise Decoding

Almost Timely News: Principles-Based Prompt Engineering (2024-02-25)

Atinary Emmental: ML Algorithm for Non-linear Constrained Optimization

RA-DT: Retrieval-Augmented Decision Transformer

Object Detection Apps and Software for Business

Launching Tomorrow 'Prompt Engineering for GenAI' Course | Feeling lost in the financial jungle?

The Rise of AI Engineering

The Future of Manufacturing: A Look at Machine Learning Algorithms

Cortex AI Engineering: Paving The Way To Client-Centric AI