登录查看更多内容

Our eBook on ‘Rethinking Anomaly Detection’

OpsCruise

Autonomous Application Performance for Cloud Native Applications.

发布日期: 2022年2月24日

Anomaly detection has become a critical part of observability for modern cloud-native and microservice applications. Unfortunately, existing legacy approaches are creating a lot of false alerts distracting DevOps/SRE teams and increasing MTTR. We at OpsCruise believed a more effective application-aware approach was needed. So we built one and put it to the test. While a more detailed paper on our study is now available, here is a summary of our anomaly detection approach and results of its efficacy from the field.

You can get the full eBook here.

Limitations of Current Anomaly Detection

Anomaly detection has been in use long before cloud and microservices. So it’s not surprising that existing detection approaches are not up to snuff to meet the challenges with the scale, complex dependencies and dynamic nature of cloud-native applications [1].?

Today, detecting anomalies falls into two broad categories: either, manually setting thresholds on a metric, or using some statistical or ML-based technique to detect an outlier. Unfortunately, both have big drawbacks.

Take manual setting of the thresholds, for example. One guesses the upper limit for response time for an application or the CPU utilization based on the past history without knowing the maximum expected request rates. In cloud applications, because workloads are not known ahead of time, when any threshold is breached, Ops typically lowers the threshold to reduce the false positives but then risks false negatives.?

Using outlier detection on a set of metrics can improve the alert noise and reduce the manual tuning efforts but faces other challenges: how do you know what metrics capture the correct baseline for the application across different load conditions? Also, every time it sees a different metric value, such as a higher latency response caused by new higher request rates it does not mean the application is not working correctly.?

We realized that not understanding the application and how it’s supposed to work causes agnostic anomaly detection to throw a very high volume of false alerts. More importantly, when the detection process does not specify how the anomaly relates to the problem source, isolating the cause is much more difficult. Remediation is pushed off to a later war room with skilled, expensive DevOps resources manually resolving the problem.

The OpsCruise Approach?

We believe microservices architecture requires an application-aware, model-based approach, using Ops and knowledge of the application stack together with ML. This means:

Embedding Application Knowledge: using a curated object model template for any service within the application. Adding this context provides an understanding of the why and what in the model’s prediction.
Applying Heuristics and Common Sense: this is embedded in rules to check and analyze problems, such as detecting “noisy neighbors”, or suppressing alerts resulting from very small changes

Anomalies are detected when there are deviations from the model that has learned correct or normal behavior from data collected continually on the service and updated periodically, e.g., daily or over a smaller period.

Figure 1: Anomaly detection process for using learned behavior models

Here are some key features of the model and how it is used:

Curated Templates: we include most available metrics for the service, dropping those we know won't affect operation (e.g., file location of data), so there are no predefined bias in model’s metrics
Find ‘unknown unknowns’: use ML to discover the metrics that drive the behavior of the application?
Learn continuously: as a service faces different operating conditions, we keep learning; but we also provide Ops an optional binary feedback to speed-up learning on a false positive alert
Explanations: generate contextual explanations to help the subsequent root cause analysis step [2]
Scale: ensure it works at scale, i.e., models for 1000s of containers can be updated in minutes
Reducing False Alerts: this is for both false positives but also false negatives so while we suppress low-level alerts using heuristics, our predictive model errs on the side of not missing a lurking problem?

领英推荐

September-in-Review: Egypt Leading the Charge in…

Information Technology Industry Development Agency, ITIDA 5 个月前

Notes on the DORA 2023 State of DevOps Report

Bytebase - Database CI/CD and Security at Scale 1 年前

Enter the Machines: Reducing Friction in DevOps Using…

Firefly 5 个月前

Results

To validate our approach, we collected empirical data from a number of deployed production environments where we had no control on designing and running the application. This is because there is no benchmark for anomaly detection even though there are open source sample microservices applications for use as a sandbox to play with monitoring.

While there are more examples and details in the eBook, here we present a summary from a serverless microservice application: detecting anomalies in a Kinesis-Lambda subsystem.?

Figure 2: Number of anomalies detected by the Behavior Model versus Dynamic Thresholding for a Serverless pipeline (AWS Kinesis - Lambda) over 8 days?

The model included 12 metrics including Execution Time and Number of Invocations (from Lambda), while for comparison using dynamic thresholding, we chose 5 metrics on which we applied the well-known Tukey’s 1.5 IQR rule. You can see from Figure 2 that the decrease in detected anomalies was 89% lower than the threshold-based approach over an 8-day period, and the number of alerts dropped significantly after the 2nd day.

For generalizing the efficacy, we also tested on a number of Kubernetes application containers. The number of model metrics are much larger, nearly 30, and we used dynamic thresholds on just five metrics such as Response Time (Latency), CPU utilization, etc. The model-based approach generated 55% fewer alerts over the 6-day period than the threshold-based approach, and the number of alerts also decreased rapidly over time.?

We have mentioned how avoiding false negative alerts is not easy, especially when we encounter previously unseen data ranges. For these cases the our approach is more effective. In a specific container case, the behavior model flagged 100 data points as anomalies. On closer inspection it became clear that over a 100-sample interval, request counts and response times were not being received and recorded as 0. But there was incoming data as shown by bytes and packet-level metrics. Since no high latency thresholds were crossed with response times being 0, no anomalies were detected by the threshold-based approach. The behavior model detected this inconsistency between demand and response metrics and marked those data correctly as valid anomalies, avoiding 100 false negatives.?

Conclusions

Modern cloud native applications require a more application-aware context, using a knowledge-augmented ML approach that learns the application’s behavior profile to detect and predict a better indicator of problems. This approach has been applied for real-time anomaly detection on metrics of microservices at scale and has been shown to significantly reduce false alerts on different applications deployed in the field and help in causal isolation and time to problem resolution.

Get the full eBook Here

References

Microservices: An explosion of metrics and few insights? February 2020.
Of Causality and Reasoning . . . OpsCruise’s Automated Root Cause Analysis, June 2021.
Methods and systems for autonomous cloud application operations, US Patent 11,126,493, OpsCruise Inc., Issued September 21, 2021.

要查看或添加评论，请登录

OpsCruise的更多文章

2023年1月19日

Automating Root Cause Analysis for Kubernetes Applications

The Problem A 2022 survey on the state of Kubernetes shows it is now a mainstream technology for software development…

1 条评论

Our eBook on ‘Rethinking Anomaly Detection’

OpsCruise

Autonomous Application Performance for Cloud Native Applications.

Limitations of Current Anomaly Detection

The OpsCruise Approach?

领英推荐

Results

Conclusions

Get the full eBook Here

References

OpsCruise的更多文章

社区洞察

其他会员也浏览了

Coca-Cola's New Microservices Model via DevOps: A Case Study in Innovation

Bringing DevSecOps, Containerization, and Microservices to ASSYST Customers

All things CD at cdCon+GitOpsCon '23

SRE in the Age of AI

Are Microservices the Backbone of 2024's SPE Trends?

DevOps : From Monolithic to Microservices

The Future of DevOps What’s Next in Automation & Cloud?

The New Kind Of Industrial Revolution

Why Microservices Were Mostly a Bad Decision

5 Tips for Managing Service Dependencies in Microservice Architecture

Limitations of Current Anomaly Detection

The OpsCruise Approach?

领英推荐

Results

Conclusions

Get the full eBook Here

References

OpsCruise的更多文章

Automating Root Cause Analysis for Kubernetes Applications

社区洞察

其他会员也浏览了

Coca-Cola's New Microservices Model via DevOps: A Case Study in Innovation

Bringing DevSecOps, Containerization, and Microservices to ASSYST Customers

All things CD at cdCon+GitOpsCon '23

SRE in the Age of AI

Are Microservices the Backbone of 2024's SPE Trends?

DevOps : From Monolithic to Microservices

The Future of DevOps What’s Next in Automation & Cloud?

The New Kind Of Industrial Revolution

Why Microservices Were Mostly a Bad Decision

5 Tips for Managing Service Dependencies in Microservice Architecture