AIOps: Forecasting with data drift considerations
Naga (Arun) Ayachitula
Vice President, AIOps Engineering (Data/Analytics & AI/ML) and Distinguished Engineer
Balakrishnan Saravanan Kesavan, Upendra Sharma, and Arun Ayachitula
It is crucial to monitor and forecast IT application monitoring metrics like slowness/latency, traffic, error rate, etc., and infrastructure capacity metrics like CPU, memory, disk, network etc. IT application metrics are essential for ensuring application health and performance. They help identify bottlenecks, predict future issues, and provide a seamless user experience. Proactive monitoring can prevent downtime and support decision-making for scaling and optimizing resources. High error rates or slowness can drive away users and negatively impact an organization's reputation, while traffic metrics help in capacity planning and managing demand. IT infrastructure capacity metrics ensure that the IT environment can handle current and future loads, maintain system performance, prevent outages, and support scalability. By anticipating growth, organizations can plan and budget for upgrades, avoid performance bottlenecks, ensure a good user experience, and keep systems reliable and efficient. It also plays a critical role in cost management, especially in cloud environments where resource utilization directly impacts expenses.
Data drift in application monitoring and capacity metrics generally stems from application environment and usage changes. Alterations in user behavior, such as increased engagement due to new features or a change in usage patterns, can cause shifts in performance metrics. Code updates, infrastructure adjustments, and external dependencies can introduce variability in how applications consume resources. Moreover, the processed data may evolve, impacting system load and behavior. Such drifts necessitate continuous monitoring to ensure optimal performance and resource allocation.
Data drift is a phenomenon that occurs when the distribution of the input data for a machine learning model changes over time, making the model less accurate or reliable. Data drift is crucial because it can affect the performance and validity of machine learning models, especially those deployed in dynamic and evolving domains. Data drift can lead to model quality degradation. Therefore, monitoring and detecting data drift and updating or retraining the models is essential. Data drift (change in the distribution of incoming data) is detected using the Kolmogorov-Smirnov, Page-Hinkley, and ADWIN tests. Since this is time series data in which only auto-correlation is of interest and no independent variables are to be considered, concept drift (change in relationship between independent and dependent variables) isn't applicable.
Statistical tests for handling data drift
The Kolmogorov-Smirnov (KS), Page-Hinkley, and ADWIN tests compare the statistical distribution of two samples to determine if they are from the same population. Whereas the Page-Hinkley and ADWIN tests compare only the means of the samples, the Kolmogorov-Smirnov compares the mean and the variances. The Kolmogorov-Smirnov (KS) statistic is used in the KS test, a nonparametric test that determines if two datasets differ significantly or if a dataset differs from a reference probability distribution. It compares the cumulative distributions of the datasets and calculates the maximum distance between these distributions. The KS statistic quantifies this distance, and a p-value is derived to test the hypothesis. The data's difference is considered statistically insignificant if the KS statistic is small or the p-value is high. Hence, we have considered the Kolmogorov-Smirnov test, and the results are shown in the analysis below.
In the example below, the data distribution for weekly samples stays about the same in June and has changed from June to July. The fact that there is no change in the first two weeks is shown in the "0.0" KS test statistic, with a high confidence p-value of "1.0". Hence, the null hypothesis of no difference in the samples cannot be rejected. In the second set of weekly samples from the first and last week of the dataset, the "0.39" KS Statistic and "0.0" p-value indicate that the null hypothesis of no difference in the samples must be rejected. That is, there is data drift from June to July.
The KS test statistic can be interpreted as a distance between the two distributions. Its value ranges from 0 (for identical distributions) to 1.
The difference in weekly samples from the first two weeks of the dataset.
Difference in weekly samples from the first and last weeks of the dataset
In the example below, in the second set of weekly samples from the first and last week of the dataset, the KS statistic and p-value indicate data drift from June to July.
The difference in weekly samples from the first two weeks of the dataset
Difference in weekly samples from the first and last weeks of the dataset
In the example below, in the second set of weekly samples from the first and last week of the dataset, the KS statistic and p-value indicate data drift from June to July.
领英推荐
The difference in weekly samples from the first two weeks of the dataset
Difference in weekly samples from the first and last weeks of the dataset
?
In the example below, in the second set of weekly samples from the first and last week of the dataset, the KS statistic and p-value indicate data drift from June to July.
The difference in weekly samples from the first two weeks of the dataset
Difference in weekly samples from the first and last weeks of the dataset
Comparing forecasting techniques for handling data drift
ETS, Prophet, and ARIMA models are compared below. As seen for this sample device, ETS' MSE, MAE, and MAPE are comparable to Prophet's. But ETS's significant advantage is running faster than Prophet or ARIMA. This time difference is critical when training and prediction are required for many devices in real-time or regularly. Consider below an example of a device with hourly observations:
Conclusion
We emphasize that when dealing with vast numbers of devices (ranging from 100K – to a million), the choice of forecasting technique is a trade-off between accuracy and computational efficiency. At the same time, ARIMA may offer slightly more precise results regarding Mean Absolute Error, Mean Square Error, etc. Exponential Triple Smoothing (ETS) is significantly faster, making it more suitable for real-time forecasting across many devices and metrics. The ability of ETS to provide comparable accuracy to ARIMA and Prophet models, combined with its superior execution speed, highlights its practical utility in large-scale IT monitoring and forecasting scenarios where performance time is of the essence.
Acknowledgments
Thanks to Girish Mohite, Krishna Sumanth Gummadi, Subbareddy Paturu, Amar Mollakantalla, Murali Batthena, Vamshi Gummalla, Nitik Kumar, Kishore Maalae, Saravanan Kumarasamy, Divakar Reddy Doma, Rainy Moona, Muhammad Danish, Suryakanth Barathi, Shakuntala Prabhu, Pradeep Soundarajan, Hyder Khan, Godwin Dsouza, Prameela S, Vipin Sreedhar, Abhishek Gurav, Santosh Kumar Panigrahi, Diwakar Natarajan, Shivam Choudhary and Sander Plug for their contributions to AIOps development.