登录查看更多内容

AIOps: Forecasting with data drift considerations

Naga (Arun) Ayachitula

Vice President, AIOps Engineering (Data/Analytics & AI/ML) and Distinguished Engineer

发布日期: 2023年12月23日

Balakrishnan Saravanan Kesavan, Upendra Sharma, and Arun Ayachitula

It is crucial to monitor and forecast IT application monitoring metrics like slowness/latency, traffic, error rate, etc., and infrastructure capacity metrics like CPU, memory, disk, network etc. IT application metrics are essential for ensuring application health and performance. They help identify bottlenecks, predict future issues, and provide a seamless user experience. Proactive monitoring can prevent downtime and support decision-making for scaling and optimizing resources. High error rates or slowness can drive away users and negatively impact an organization's reputation, while traffic metrics help in capacity planning and managing demand. IT infrastructure capacity metrics ensure that the IT environment can handle current and future loads, maintain system performance, prevent outages, and support scalability. By anticipating growth, organizations can plan and budget for upgrades, avoid performance bottlenecks, ensure a good user experience, and keep systems reliable and efficient. It also plays a critical role in cost management, especially in cloud environments where resource utilization directly impacts expenses.

Application Performance Monitoring(APM) Metrics

Data drift in application monitoring and capacity metrics generally stems from application environment and usage changes. Alterations in user behavior, such as increased engagement due to new features or a change in usage patterns, can cause shifts in performance metrics. Code updates, infrastructure adjustments, and external dependencies can introduce variability in how applications consume resources. Moreover, the processed data may evolve, impacting system load and behavior. Such drifts necessitate continuous monitoring to ensure optimal performance and resource allocation.

Data drift is a phenomenon that occurs when the distribution of the input data for a machine learning model changes over time, making the model less accurate or reliable. Data drift is crucial because it can affect the performance and validity of machine learning models, especially those deployed in dynamic and evolving domains. Data drift can lead to model quality degradation. Therefore, monitoring and detecting data drift and updating or retraining the models is essential. Data drift (change in the distribution of incoming data) is detected using the Kolmogorov-Smirnov, Page-Hinkley, and ADWIN tests. Since this is time series data in which only auto-correlation is of interest and no independent variables are to be considered, concept drift (change in relationship between independent and dependent variables) isn't applicable.

Statistical tests for handling data drift

The Kolmogorov-Smirnov (KS), Page-Hinkley, and ADWIN tests compare the statistical distribution of two samples to determine if they are from the same population. Whereas the Page-Hinkley and ADWIN tests compare only the means of the samples, the Kolmogorov-Smirnov compares the mean and the variances. The Kolmogorov-Smirnov (KS) statistic is used in the KS test, a nonparametric test that determines if two datasets differ significantly or if a dataset differs from a reference probability distribution. It compares the cumulative distributions of the datasets and calculates the maximum distance between these distributions. The KS statistic quantifies this distance, and a p-value is derived to test the hypothesis. The data's difference is considered statistically insignificant if the KS statistic is small or the p-value is high. Hence, we have considered the Kolmogorov-Smirnov test, and the results are shown in the analysis below.

In the example below, the data distribution for weekly samples stays about the same in June and has changed from June to July. The fact that there is no change in the first two weeks is shown in the "0.0" KS test statistic, with a high confidence p-value of "1.0". Hence, the null hypothesis of no difference in the samples cannot be rejected. In the second set of weekly samples from the first and last week of the dataset, the "0.39" KS Statistic and "0.0" p-value indicate that the null hypothesis of no difference in the samples must be rejected. That is, there is data drift from June to July.

The KS test statistic can be interpreted as a distance between the two distributions. Its value ranges from 0 (for identical distributions) to 1.

The difference in weekly samples from the first two weeks of the dataset.

Number of observations in the two samples:? 168 168
KS test statistic and p-value: [0.0, 1.0]

Difference in weekly samples from the first and last weeks of the dataset

Number of observations in the two samples:? 168 168
KS test statistic and p-value: [0.39, 0.0]

In the example below, in the second set of weekly samples from the first and last week of the dataset, the KS statistic and p-value indicate data drift from June to July.

The difference in weekly samples from the first two weeks of the dataset

Number of observations in the two samples:? 168 168
KS test statistic and p-value: [0.0, 1.0]

Difference in weekly samples from the first and last weeks of the dataset

Number of observations in the two samples:? 168 168
KS test statistic and p-value: [0.91, 0.0]

In the example below, in the second set of weekly samples from the first and last week of the dataset, the KS statistic and p-value indicate data drift from June to July.

领英推荐

Top 3 Telemetry Pipelines: Cribl vs Edge Delta vs DIY…

Ozan Unlu 3 周前

Competitive Edge: How Automating Data Workflows Drives…

Alkymi 1 个月前

Transforming Historical Data: Ensuring Quality…

Aparna K. 8 个月前

The difference in weekly samples from the first two weeks of the dataset

Number of observations in the two samples:? 168 168
KS test statistic and p-value: [0.0, 1.0]

Difference in weekly samples from the first and last weeks of the dataset

Number of observations in the two samples:? 168 168
KS test statistic and p-value: [0.16, 0.03]

In the example below, in the second set of weekly samples from the first and last week of the dataset, the KS statistic and p-value indicate data drift from June to July.

The difference in weekly samples from the first two weeks of the dataset

Number of observations in the two samples:? 168 168
KS test statistic and p-value: [0.0, 1.0]

Difference in weekly samples from the first and last weeks of the dataset

Number of observations in the two samples:? 168 168
KS test statistic and p-value: [0.16, 0.03]

Comparing forecasting techniques for handling data drift

ETS, Prophet, and ARIMA models are compared below. As seen for this sample device, ETS' MSE, MAE, and MAPE are comparable to Prophet's. But ETS's significant advantage is running faster than Prophet or ARIMA. This time difference is critical when training and prediction are required for many devices in real-time or regularly. Consider below an example of a device with hourly observations:

Conclusion

We emphasize that when dealing with vast numbers of devices (ranging from 100K – to a million), the choice of forecasting technique is a trade-off between accuracy and computational efficiency. At the same time, ARIMA may offer slightly more precise results regarding Mean Absolute Error, Mean Square Error, etc. Exponential Triple Smoothing (ETS) is significantly faster, making it more suitable for real-time forecasting across many devices and metrics. The ability of ETS to provide comparable accuracy to ARIMA and Prophet models, combined with its superior execution speed, highlights its practical utility in large-scale IT monitoring and forecasting scenarios where performance time is of the essence.

Acknowledgments

Thanks to Girish Mohite, Krishna Sumanth Gummadi, Subbareddy Paturu, Amar Mollakantalla, Murali Batthena, Vamshi Gummalla, Nitik Kumar, Kishore Maalae, Saravanan Kumarasamy, Divakar Reddy Doma, Rainy Moona, Muhammad Danish, Suryakanth Barathi, Shakuntala Prabhu, Pradeep Soundarajan, Hyder Khan, Godwin Dsouza, Prameela S, Vipin Sreedhar, Abhishek Gurav, Santosh Kumar Panigrahi, Diwakar Natarajan, Shivam Choudhary and Sander Plug for their contributions to AIOps development.

要查看或添加评论，请登录

Naga (Arun) Ayachitula的更多文章

Cost Savings with LLM Optimizations

2025年3月29日

Cost Savings with LLM Optimizations

Upendra Sharma* and Arun Ayachitula Agentic AI systems drive significantly higher costs due to their heavy reliance on…
How Intel Gaudi-2 Optimizations Drive Significant Cost Savings

2025年1月31日

How Intel Gaudi-2 Optimizations Drive Significant Cost Savings

Upendra Sharma*, Arun Ayachitula Generative AI is transforming industries, but its soaring costs demand more innovative…

2 条评论
Cost Efficiency in IT Enterprises: Leveraging Quantization for Generative AI

2024年6月2日

Cost Efficiency in IT Enterprises: Leveraging Quantization for Generative AI

Upendra Sharma, Arun Ayachitula Generative AI models like GPT-4 require powerful GPUs for training and inference…

2 条评论
AIOps experiments on the NVIDIA GH200 Grace Hopper?

2024年4月13日

AIOps experiments on the NVIDIA GH200 Grace Hopper?

Balakrishnan Saravanan Kesavan, Upendra Sharma, and Arun Ayachitula Numerous intricate Natural Language Processing…
Integrated AIOps - IT Service Health

2024年3月24日

Integrated AIOps - IT Service Health

Upendra Sharma, Girish Mohite & Arun Ayachitula Service Health is a multifaceted health monitoring system for IT…

1 条评论
Text Similarity

2023年12月30日

Text Similarity

Upendra Sharma, Arun Ayachitula 1. Motivation While adept at storing factual knowledge and excelling in NLP tasks…
AIOps: Time series analysis – Forecasting

2023年12月22日

AIOps: Time series analysis – Forecasting

Balakrishnan Saravanan Kesavan, Upendra Sharma and Arun Ayachitula Measures by Business Objectives (MBOs) in IT Service…
AIOps: Interpretability using Disjunctive Normal Form

2023年2月15日

AIOps: Interpretability using Disjunctive Normal Form

Arun Ayachitula & Upendra Sharma Interpretability and Explainability of AI/ML models have been differentiated in the…
AIOps - Explainability using pertinent positives

2023年2月12日

AIOps - Explainability using pertinent positives

Arun Ayachitula, Rohit Khandekar & Upendra Sharma Classifier Explainability is a Broad AI practice to explain the…

1 条评论
AIOps: Biases and Fairness in AI/ML

2023年1月26日

AIOps: Biases and Fairness in AI/ML

Arun Ayachitula, Rohit Khandekar & Upendra Sharma Fairness in AI has received much attention recently due to ethical…

See all articles

AIOps: Forecasting with data drift considerations

Naga (Arun) Ayachitula

Vice President, AIOps Engineering (Data/Analytics & AI/ML) and Distinguished Engineer

Statistical tests for handling data drift

领英推荐

Comparing forecasting techniques for handling data drift

Conclusion

Naga (Arun) Ayachitula的更多文章

社区洞察

其他会员也浏览了

June 14, 2024

Implementing Unified Namespace Using EMQX and Neuron

The Digital Renaissance: When Data Lakes Meet Intelligence/LLM

Top 5 Data Engineering Trends in 2025

Radically Rethink your Tech for More Impact with Less Effort: Part 2 – Liberate Your Data by Decoupling from your Legacy

From Batch Processing to Real-Time Data: A Journey Through Time

Radically Rethink your Data and Tech Approach for More Impact with Less Effort: Part 2 – Liberate Your Data by Decoupling from your Legacy

Getting Started with Splunk

Self-Tuning Algorithms in the Context of IT Data Storage and Management

Transform Your Business with Data Engineering Services

Statistical tests for handling data drift

领英推荐

Comparing forecasting techniques for handling data drift

Conclusion

Naga (Arun) Ayachitula的更多文章

Cost Savings with LLM Optimizations

How Intel Gaudi-2 Optimizations Drive Significant Cost Savings

Cost Efficiency in IT Enterprises: Leveraging Quantization for Generative AI

AIOps experiments on the NVIDIA GH200 Grace Hopper?

Integrated AIOps - IT Service Health

Text Similarity

AIOps: Time series analysis – Forecasting

AIOps: Interpretability using Disjunctive Normal Form

AIOps - Explainability using pertinent positives

AIOps: Biases and Fairness in AI/ML

社区洞察

其他会员也浏览了

June 14, 2024

Implementing Unified Namespace Using EMQX and Neuron

The Digital Renaissance: When Data Lakes Meet Intelligence/LLM

Top 5 Data Engineering Trends in 2025

Radically Rethink your Tech for More Impact with Less Effort: Part 2 – Liberate Your Data by Decoupling from your Legacy

From Batch Processing to Real-Time Data: A Journey Through Time

Radically Rethink your Data and Tech Approach for More Impact with Less Effort: Part 2 – Liberate Your Data by Decoupling from your Legacy

Getting Started with Splunk

Self-Tuning Algorithms in the Context of IT Data Storage and Management

Transform Your Business with Data Engineering Services