AIOps: Forecasting with data drift considerations

AIOps: Forecasting with data drift considerations

Balakrishnan Saravanan Kesavan, Upendra Sharma, and Arun Ayachitula

It is crucial to monitor and forecast IT application monitoring metrics like slowness/latency, traffic, error rate, etc., and infrastructure capacity metrics like CPU, memory, disk, network etc. IT application metrics are essential for ensuring application health and performance. They help identify bottlenecks, predict future issues, and provide a seamless user experience. Proactive monitoring can prevent downtime and support decision-making for scaling and optimizing resources. High error rates or slowness can drive away users and negatively impact an organization's reputation, while traffic metrics help in capacity planning and managing demand. IT infrastructure capacity metrics ensure that the IT environment can handle current and future loads, maintain system performance, prevent outages, and support scalability. By anticipating growth, organizations can plan and budget for upgrades, avoid performance bottlenecks, ensure a good user experience, and keep systems reliable and efficient. It also plays a critical role in cost management, especially in cloud environments where resource utilization directly impacts expenses.

Application Performance Monitoring(APM) Metrics
IT infrastructure capacity metrics

Data drift in application monitoring and capacity metrics generally stems from application environment and usage changes. Alterations in user behavior, such as increased engagement due to new features or a change in usage patterns, can cause shifts in performance metrics. Code updates, infrastructure adjustments, and external dependencies can introduce variability in how applications consume resources. Moreover, the processed data may evolve, impacting system load and behavior. Such drifts necessitate continuous monitoring to ensure optimal performance and resource allocation.

Data drift is a phenomenon that occurs when the distribution of the input data for a machine learning model changes over time, making the model less accurate or reliable. Data drift is crucial because it can affect the performance and validity of machine learning models, especially those deployed in dynamic and evolving domains. Data drift can lead to model quality degradation. Therefore, monitoring and detecting data drift and updating or retraining the models is essential. Data drift (change in the distribution of incoming data) is detected using the Kolmogorov-Smirnov, Page-Hinkley, and ADWIN tests. Since this is time series data in which only auto-correlation is of interest and no independent variables are to be considered, concept drift (change in relationship between independent and dependent variables) isn't applicable.

Statistical tests for handling data drift

The Kolmogorov-Smirnov (KS), Page-Hinkley, and ADWIN tests compare the statistical distribution of two samples to determine if they are from the same population. Whereas the Page-Hinkley and ADWIN tests compare only the means of the samples, the Kolmogorov-Smirnov compares the mean and the variances. The Kolmogorov-Smirnov (KS) statistic is used in the KS test, a nonparametric test that determines if two datasets differ significantly or if a dataset differs from a reference probability distribution. It compares the cumulative distributions of the datasets and calculates the maximum distance between these distributions. The KS statistic quantifies this distance, and a p-value is derived to test the hypothesis. The data's difference is considered statistically insignificant if the KS statistic is small or the p-value is high. Hence, we have considered the Kolmogorov-Smirnov test, and the results are shown in the analysis below.

In the example below, the data distribution for weekly samples stays about the same in June and has changed from June to July. The fact that there is no change in the first two weeks is shown in the "0.0" KS test statistic, with a high confidence p-value of "1.0". Hence, the null hypothesis of no difference in the samples cannot be rejected. In the second set of weekly samples from the first and last week of the dataset, the "0.39" KS Statistic and "0.0" p-value indicate that the null hypothesis of no difference in the samples must be rejected. That is, there is data drift from June to July.

The KS test statistic can be interpreted as a distance between the two distributions. Its value ranges from 0 (for identical distributions) to 1.

The difference in weekly samples from the first two weeks of the dataset.

  • Number of observations in the two samples:? 168 168
  • KS test statistic and p-value: [0.0, 1.0]

Difference in weekly samples from the first and last weeks of the dataset

  • Number of observations in the two samples:? 168 168
  • KS test statistic and p-value: [0.39, 0.0]

In the example below, in the second set of weekly samples from the first and last week of the dataset, the KS statistic and p-value indicate data drift from June to July.

The difference in weekly samples from the first two weeks of the dataset

  • Number of observations in the two samples:? 168 168
  • KS test statistic and p-value: [0.0, 1.0]

Difference in weekly samples from the first and last weeks of the dataset

  • Number of observations in the two samples:? 168 168
  • KS test statistic and p-value: [0.91, 0.0]

In the example below, in the second set of weekly samples from the first and last week of the dataset, the KS statistic and p-value indicate data drift from June to July.

The difference in weekly samples from the first two weeks of the dataset

  • Number of observations in the two samples:? 168 168
  • KS test statistic and p-value: [0.0, 1.0]

Difference in weekly samples from the first and last weeks of the dataset

  • Number of observations in the two samples:? 168 168
  • KS test statistic and p-value: [0.16, 0.03]

?

In the example below, in the second set of weekly samples from the first and last week of the dataset, the KS statistic and p-value indicate data drift from June to July.

The difference in weekly samples from the first two weeks of the dataset

  • Number of observations in the two samples:? 168 168
  • KS test statistic and p-value: [0.0, 1.0]

Difference in weekly samples from the first and last weeks of the dataset

  • Number of observations in the two samples:? 168 168
  • KS test statistic and p-value: [0.16, 0.03]

Comparing forecasting techniques for handling data drift

ETS, Prophet, and ARIMA models are compared below. As seen for this sample device, ETS' MSE, MAE, and MAPE are comparable to Prophet's. But ETS's significant advantage is running faster than Prophet or ARIMA. This time difference is critical when training and prediction are required for many devices in real-time or regularly. Consider below an example of a device with hourly observations:

Algorithm Execution times

Conclusion

We emphasize that when dealing with vast numbers of devices (ranging from 100K – to a million), the choice of forecasting technique is a trade-off between accuracy and computational efficiency. At the same time, ARIMA may offer slightly more precise results regarding Mean Absolute Error, Mean Square Error, etc. Exponential Triple Smoothing (ETS) is significantly faster, making it more suitable for real-time forecasting across many devices and metrics. The ability of ETS to provide comparable accuracy to ARIMA and Prophet models, combined with its superior execution speed, highlights its practical utility in large-scale IT monitoring and forecasting scenarios where performance time is of the essence.


Acknowledgments

Thanks to Girish Mohite, Krishna Sumanth Gummadi, Subbareddy Paturu, Amar Mollakantalla, Murali Batthena, Vamshi Gummalla, Nitik Kumar, Kishore Maalae, Saravanan Kumarasamy, Divakar Reddy Doma, Rainy Moona, Muhammad Danish, Suryakanth Barathi, Shakuntala Prabhu, Pradeep Soundarajan, Hyder Khan, Godwin Dsouza, Prameela S, Vipin Sreedhar, Abhishek Gurav, Santosh Kumar Panigrahi, Diwakar Natarajan, Shivam Choudhary and Sander Plug for their contributions to AIOps development.


要查看或添加评论,请登录

Naga (Arun) Ayachitula的更多文章

  • Cost Savings with LLM Optimizations

    Cost Savings with LLM Optimizations

    Upendra Sharma* and Arun Ayachitula Agentic AI systems drive significantly higher costs due to their heavy reliance on…

  • How Intel Gaudi-2 Optimizations Drive Significant Cost Savings

    How Intel Gaudi-2 Optimizations Drive Significant Cost Savings

    Upendra Sharma*, Arun Ayachitula Generative AI is transforming industries, but its soaring costs demand more innovative…

    2 条评论
  • Cost Efficiency in IT Enterprises: Leveraging Quantization for Generative AI

    Cost Efficiency in IT Enterprises: Leveraging Quantization for Generative AI

    Upendra Sharma, Arun Ayachitula Generative AI models like GPT-4 require powerful GPUs for training and inference…

    2 条评论
  • AIOps experiments on the NVIDIA GH200 Grace Hopper?

    AIOps experiments on the NVIDIA GH200 Grace Hopper?

    Balakrishnan Saravanan Kesavan, Upendra Sharma, and Arun Ayachitula Numerous intricate Natural Language Processing…

  • Integrated AIOps - IT Service Health

    Integrated AIOps - IT Service Health

    Upendra Sharma, Girish Mohite & Arun Ayachitula Service Health is a multifaceted health monitoring system for IT…

    1 条评论
  • Text Similarity

    Text Similarity

    Upendra Sharma, Arun Ayachitula 1. Motivation While adept at storing factual knowledge and excelling in NLP tasks…

  • AIOps: Time series analysis – Forecasting

    AIOps: Time series analysis – Forecasting

    Balakrishnan Saravanan Kesavan, Upendra Sharma and Arun Ayachitula Measures by Business Objectives (MBOs) in IT Service…

  • AIOps: Interpretability using Disjunctive Normal Form

    AIOps: Interpretability using Disjunctive Normal Form

    Arun Ayachitula & Upendra Sharma Interpretability and Explainability of AI/ML models have been differentiated in the…

  • AIOps - Explainability using pertinent positives

    AIOps - Explainability using pertinent positives

    Arun Ayachitula, Rohit Khandekar & Upendra Sharma Classifier Explainability is a Broad AI practice to explain the…

    1 条评论
  • AIOps: Biases and Fairness in AI/ML

    AIOps: Biases and Fairness in AI/ML

    Arun Ayachitula, Rohit Khandekar & Upendra Sharma Fairness in AI has received much attention recently due to ethical…

社区洞察

其他会员也浏览了