Our eBook on ‘Rethinking Anomaly Detection’

Our eBook on ‘Rethinking Anomaly Detection’

Anomaly detection has become a critical part of observability for modern cloud-native and microservice applications. Unfortunately, existing legacy approaches are creating a lot of false alerts distracting DevOps/SRE teams and increasing MTTR. We at OpsCruise believed a more effective application-aware approach was needed. So we built one and put it to the test. While a more detailed paper on our study is now available, here is a summary of our anomaly detection approach and results of its efficacy from the field.

You can get the full eBook here.

Limitations of Current Anomaly Detection

Anomaly detection has been in use long before cloud and microservices. So it’s not surprising that existing detection approaches are not up to snuff to meet the challenges with the scale, complex dependencies and dynamic nature of cloud-native applications [1].?

Today, detecting anomalies falls into two broad categories: either, manually setting thresholds on a metric, or using some statistical or ML-based technique to detect an outlier. Unfortunately, both have big drawbacks.

Take manual setting of the thresholds, for example. One guesses the upper limit for response time for an application or the CPU utilization based on the past history without knowing the maximum expected request rates. In cloud applications, because workloads are not known ahead of time, when any threshold is breached, Ops typically lowers the threshold to reduce the false positives but then risks false negatives.?

Using outlier detection on a set of metrics can improve the alert noise and reduce the manual tuning efforts but faces other challenges: how do you know what metrics capture the correct baseline for the application across different load conditions? Also, every time it sees a different metric value, such as a higher latency response caused by new higher request rates it does not mean the application is not working correctly.?

We realized that not understanding the application and how it’s supposed to work causes agnostic anomaly detection to throw a very high volume of false alerts. More importantly, when the detection process does not specify how the anomaly relates to the problem source, isolating the cause is much more difficult. Remediation is pushed off to a later war room with skilled, expensive DevOps resources manually resolving the problem.

The OpsCruise Approach?

We believe microservices architecture requires an application-aware, model-based approach, using Ops and knowledge of the application stack together with ML. This means:

  • Embedding Application Knowledge: using a curated object model template for any service within the application. Adding this context provides an understanding of the why and what in the model’s prediction.
  • Applying Heuristics and Common Sense: this is embedded in rules to check and analyze problems, such as detecting “noisy neighbors”, or suppressing alerts resulting from very small changes

Anomalies are detected when there are deviations from the model that has learned correct or normal behavior from data collected continually on the service and updated periodically, e.g., daily or over a smaller period.

No alt text provided for this image

Figure 1: Anomaly detection process for using learned behavior models

Here are some key features of the model and how it is used:

  • Curated Templates: we include most available metrics for the service, dropping those we know won't affect operation (e.g., file location of data), so there are no predefined bias in model’s metrics
  • Find ‘unknown unknowns’: use ML to discover the metrics that drive the behavior of the application?
  • Learn continuously: as a service faces different operating conditions, we keep learning; but we also provide Ops an optional binary feedback to speed-up learning on a false positive alert
  • Explanations: generate contextual explanations to help the subsequent root cause analysis step [2]
  • Scale: ensure it works at scale, i.e., models for 1000s of containers can be updated in minutes
  • Reducing False Alerts: this is for both false positives but also false negatives so while we suppress low-level alerts using heuristics, our predictive model errs on the side of not missing a lurking problem?

Results

To validate our approach, we collected empirical data from a number of deployed production environments where we had no control on designing and running the application. This is because there is no benchmark for anomaly detection even though there are open source sample microservices applications for use as a sandbox to play with monitoring.

While there are more examples and details in the eBook, here we present a summary from a serverless microservice application: detecting anomalies in a Kinesis-Lambda subsystem.?

No alt text provided for this image

Figure 2: Number of anomalies detected by the Behavior Model versus Dynamic Thresholding for a Serverless pipeline (AWS Kinesis - Lambda) over 8 days?

The model included 12 metrics including Execution Time and Number of Invocations (from Lambda), while for comparison using dynamic thresholding, we chose 5 metrics on which we applied the well-known Tukey’s 1.5 IQR rule. You can see from Figure 2 that the decrease in detected anomalies was 89% lower than the threshold-based approach over an 8-day period, and the number of alerts dropped significantly after the 2nd day.

For generalizing the efficacy, we also tested on a number of Kubernetes application containers. The number of model metrics are much larger, nearly 30, and we used dynamic thresholds on just five metrics such as Response Time (Latency), CPU utilization, etc. The model-based approach generated 55% fewer alerts over the 6-day period than the threshold-based approach, and the number of alerts also decreased rapidly over time.?

We have mentioned how avoiding false negative alerts is not easy, especially when we encounter previously unseen data ranges. For these cases the our approach is more effective. In a specific container case, the behavior model flagged 100 data points as anomalies. On closer inspection it became clear that over a 100-sample interval, request counts and response times were not being received and recorded as 0. But there was incoming data as shown by bytes and packet-level metrics. Since no high latency thresholds were crossed with response times being 0, no anomalies were detected by the threshold-based approach. The behavior model detected this inconsistency between demand and response metrics and marked those data correctly as valid anomalies, avoiding 100 false negatives.?

Conclusions

Modern cloud native applications require a more application-aware context, using a knowledge-augmented ML approach that learns the application’s behavior profile to detect and predict a better indicator of problems. This approach has been applied for real-time anomaly detection on metrics of microservices at scale and has been shown to significantly reduce false alerts on different applications deployed in the field and help in causal isolation and time to problem resolution.

Get the full eBook Here

References

  1. Microservices: An explosion of metrics and few insights? February 2020.
  2. Of Causality and Reasoning . . . OpsCruise’s Automated Root Cause Analysis, June 2021.
  3. Methods and systems for autonomous cloud application operations, US Patent 11,126,493, OpsCruise Inc., Issued September 21, 2021.

No alt text provided for this image


要查看或添加评论,请登录

OpsCruise的更多文章

社区洞察

其他会员也浏览了