Getting a Grip on Data and Model Drift with Azure Machine Learning
Natasha Savic
Senior Data Scientist & AI Tech Lead at Databricks | GenAI Architecture | AI Engineering
By Natasha Savic and Andreas Kopp ?
Change is the only constant in life. In machine learning, it shows up as drift of data, model predictions and decaying performance, if not managed carefully.??
In this article, we discuss data and model drift and how it affects the performance of production models. You will learn methods to identify and to mitigate drift and MLOps best practices to transition from static models to evergreen AI services using Azure Machine Learning.?
We also include a sample notebook if you want to try out the concepts in practical examples.?
Understanding data and model drift?
Many machine learning projects conclude after a phase of extensive data and feature engineering, modeling, training, and evaluation with a satisfactory model that is deployed to production. However, the longer a model is in operation, the more problems can creep in that might remain undetected for quite a long time.?
Figure 1: Data drift and performance degradation due to model drift?
Data drift means that distributions of input data change over time. Drift can lead to a gap between what the model has initially learned from the training data and the inferencing observations during production. Let’s look at a few examples of data drift:?
Model drift is accompanied by a decrease of model performance over time (e.g., accuracy drop in a supervised classification use case). There are two main sources of model drift:?
Model drift typically stays undetected until new ground truth labels are available. The original test data is no longer a reliable benchmark because the real-world function has changed.?
The following illustration summarizes the various kinds of drift:?
Figure 2: Types of drift. Adopted from Data and concept drifts in machine learning | Towards Data Science ?
The transition from normal behavior to drift can be vastly different. Demographic changes in the real-world typically lead to gradual data or model drift. However, a broken sensor might cause abrupt deviations from the normal range. Seasonal fluctuations in buying behavior (e.g., Christmas season) are manifested as a recurring drift.?
If we have timestamps for our observations (or the data points are at least arranged chronologically), the following can be done to detect, analyze and mitigate drift.?
We will describe these methods in more detail below and experiment with them using a predictive maintenance case study.?
From static to evergreen models?
Figure 3: From static to evergreen models?
The options to analyze and mitigate data and model drift depend on the availability of current data over the machine learning model’s lifecycle.?
Let us assume that a bank collected historical data to train a model to support credit lending decisions. The goal is to predict whether a loan application should be approved or rejected. Labelled training data was collected in the period from January to December 2020.?
The bank’s data scientists have spent the first quarter of 2021 training and evaluating the model and decided to bring it to production in April 2021. Let us look at three options the team can use for collecting production data:?
?Figure 4: Good drift management depends on data availability?
Scenario 1: Static model?
Here, the team doesn’t collect any production data. Perhaps they did not consider this at all since their project scope only covered delivering the initial model. Another reason could be open data privacy questions (storing regulated personal data).??
Obviously, there is not much that can be done to detect data or model drift beyond analyzing the historical training data. Drift may only be uncovered when model users start complaining about the model predictions are becoming increasingly unsuitable for business decisions. However, since feedback is not systematically collected, gradual drift will likely remain undiscovered for a long time.??
Interestingly, many productive machine learning models are run in this mode today. However, machine learning lifecycle management procedures like MLOps are getting more traction in practice to address issues like these.?
The static model approach might be acceptable if the model is trained on representative data and the feature/target relationship is stable over time (e.g., biological phenomena which change at evolutionary pace).?
Scenario 2: Collecting production data?
The team decides to collect observed input data (features) from the production phase together with the corresponding model predictions.?
This approach is straightforward to implement if there are no data protection concerns or other organizational hurdles. By comparing the recent production data with original training observations, drift in features and predicted labels can be found. Significant shifts of key features (in terms of feature importance) can be used as a trigger for further investigation.?
However, essential information is missing to find out if there is a problem with the model: we do not have new ground truth labels to evaluate the production predictions. This might lead to the following situations:?
Scenario 3: Evergreen model?
In this scenario, the bank not only analyzes production input and predictions for potential drift but also collects labeled data. Depending on the business context, this can be done in one of the following ways:?
Incorporating human-in-the-loop feedback requires adjustment of processes and systems (e.g., business users can overwrite or flag incorrect predictions in their applications).?
The main advantages are that concept drift can be identified with high reliability and the model can regularly be refreshed by retraining.?
Incorporating business feedback and regular retraining are an essential part of mature MLOps practices (see our reference architecture example for Azure Machine Learning below).?
Data and model drift management in practice?
It is essential to have a detection mechanism that measures drift systematically. Ideally, such a mechanism is part of an integrated MLOps workflow which compares training and inference distributions on a continuous basis. We have compiled several mechanisms that support data and model drift management.?
We are using a predictive maintenance use case based on a synthetic dataset in our sample notebook . The goal is to predict equipment failure based on features like speed or heat deviations, operator, assembly line, days since the last service, etc.??
To identify drift, we combine statistical techniques and distribution overlaps (data drift) as well as predictive techniques (model/concept drift). For both drift types, we will briefly introduce the method used.?
Drift detection starts by partitioning a dataset of chronologically sorted observations into a reference and current window. The reference (or baseline) window represents older observations and is often identical with the initial training data. The current window typically reflects more recent data points seen in the production phase. This is not a strict 1:1 mapping as it might be needed to adjust the windows to better locate when drift occurred.??
领英推荐
?Figure 5: Partitioning the chronological dataset into reference and current windows?
We first need to differentiate between numerical and categorical/discrete data. For the statistical tests, both types of data will undergo distinct non-parametric tests that provide a p-value. We are handling different sample sizes and do not make assumptions about the actual distribution of our data. Therefore, non-parametric approaches are a handy way to test the similarity of two samples without needing to know the actual probability distribution.?
Those tests allow us to accept or reject the null hypothesis with a degree of confidence, as defined by the p-value. As such, you can control the sensitivity of the test by adjusting the threshold for the p-value. We recommend a more conservative p-value such as 0.01 by default. The larger your sample gets, the more prone it is to pick up on noise. Other commonly used methods to figure out the drift between distributions are the Wasserstein Distance for continuous and the JS Divergence for probability distributions.?
Here are some best practices to limit the number of false alarms in drift detection:?
While statistical tests are useful to identify drift, it is hard to interpret the magnitude of the drift as well as in which direction it occurs. Given a variable like age, did the sample get older or younger and how is the age spread? To answer those questions, it is useful to visualize the distributions. For this, we add another non-parametric method: the Kernel Density Estimation (KDE) . Since we have two different data types, we will perform a pre-processing step on the categorical data to convert it into a pseudo-numerical notation by encoding the variables. The same ordinal encoder object is used for both the reference, and the current distributions to ensure consistency:?
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer, make_column_selector as selector?
?
# identify numerical and categorical columns?
numerical_columns_selector = selector(dtype_exclude=object)?
categorical_columns_selector = selector(dtype_include=object)?
?
numerical_columns = numerical_columns_selector(reference[columns])?
categorical_columns = categorical_columns_selector(reference[columns])?
? ??
# label encoding for plots of categorical columns?
categorical_transformer = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)?
?
reference_le = categorical_transformer.fit_transform(reference[categorical_columns])?
reference_le = pd.DataFrame(reference_le)?
reference_le.columns = categorical_columns?
?
current_le = categorical_transformer.transform(current[categorical_columns])?
current_le = pd.DataFrame(current_le)?
current_le.columns = categorical_columns??
?Now that we have encoded our data, we can visually inspect either the entire dataset or selected variables of interest. We use the Kernel Density Estimation functions of the reference and current samples to compute their intersection in percent. The following steps were adapted from this sample :?
The steps can be understood in more depth by checking out the code samples. The result of the KDE intersections looks as follows:?
Figure 6: Comparing KDE intersections to identify data drift?
A brief inspection of the plots tells us:?
If we look at the variable “operator”, we can see that there was a substantial change in terms of which employee was operating the machine between when the model was fitted versus today. For example, it could have happened that an operator has retired and, consequently, does not operate any machines anymore (operator 2). Conversely, we see that new operators have joined who were not present before (e.g., operator 6).??
Now that we have seen how to uncover drift in features and labels, let us find out if the model is affected by data or concept drift.?
Similar as before, we stack the historical (training) observations and recent labeled datapoints together to a chronological dataset. Then, we compare model performance based on the most recent observations, as the following overview illustrates.??
?Figure 7: Predictive model drift detection?
At the core, we want to answer the question: Does a newer model perform better in predicting on most recent data than a model trained on older observations???
We probably have no idea in advance if and where drift has crept in.??
Therefore, in our first attempt we might use the original training data as reference and the inference observations as current windows. If we find the existence of drift by this, we will likely try out different reference and current windows to pinpoint where exactly drift crept in.?A couple of options exist for predictive model drift detection:?
Each option has specific advantages. As a result of the first alternative, you already have a trained candidate ready for deployment if model 2 outperforms model 1. The second option will likely reduce false positives, at the expense of being less sensitive to drift. Alternative 3 reduces unnecessary training cycles in cases where no drift is identified.?
Let’s check out how to use option 1 to find drift in our predictive maintenance use case. The aggregated dataset consists of 45,000 timestamped observations which we spilt into 20,000 reference, 20,000 current, and 5,000 most recent observations for test.?
We define a scikit-learn pipeline to preprocess numerical and categorical features and train two LightGBM classifiers for comparison:?
numeric_transformer = Pipeline(steps=[
? ? ('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),?
? ? ('standard_scaler', StandardScaler())])?
?
categorical_transformer = Pipeline(steps=[?
? ? ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),?
? ? ('one-hot-encoder', OneHotEncoder(handle_unknown='ignore', sparse=False))])?
?
transformations = ColumnTransformer(?
? ? transformers=[?
? ? ? ? ('num', numeric_transformer, numerical_columns),?
? ? ? ? ('cat', categorical_transformer, categorical_columns)])?
?
The training and evaluation of the reference classifier is then performed as follows:?
reference_clf = Pipeline(steps=[('preprocessor', transformations),?
? ? ? ? ? ? ? ? ? ? ? ? ? ?('clf', LGBMClassifier())])?
?
reference_clf.fit(X_reference, y_reference)?
reference_pred = reference_clf.predict(X_test)?
reference_probas = reference_clf.predict_proba(X_test)?
?
print('Reference set classification report:')?
print(classification_report(y_test, reference_pred, digits=4))??
After repeating the last step for the current classifier, we compare the performance metrics to find out if there is a noticeable gap between the models:?
The current model outperforms the reference model by a large margin. Therefore, we can conclude that we indeed have identified model drift and that the current model is a promising candidate for replacing the production model. A visual way of inspection is to compare the distributions of confidence scores of both classifiers:?
Figure 8: Model drift impact on predicted class probabilities (left) with KDE intersection (right)?
The histograms on the left show a clear difference between the predicted class probabilities of the reference and current models and therefore also confirm the existence of model drift.?
Finally, we reuse our KDE plots and statistical tests from the data drift section to measure the extend of the drift. The intersection between the KDE plots for both classifiers amount to only 85%. Furthermore, the results of the KS test suggest that the distributions are not identical.?
In this example, the results were to be expected because we intentionally built drift into our synthetic predictive maintenance dataset. With a real-world dataset, results won’t always be as obvious. Also, it might be necessary to try different reference and current window splits to reliably find model drift.?
MLOps reference architecture for evergreen models??
We will now focus on embedding drift detection and mitigation in a MLOps architecture with Azure Machine Learning. The following section leverages concepts such as Azure ML Datasets, Models and Pipelines. The demo repository provides an example of an Azure ML pipeline for generating the data drift detection plots. To automate the re-training of models, we recommend using the latest Azure MLOPs code samples and documentation. The following illustration provides a sample architecture including everything we learned about data and model drift so far.?
Figure 9: MLOps architecture for evergreen models?
By considering drift mitigation as part of an automated Azure MLOps workflow, we can maintain evergreen ML services with manageable effort. To do so, we perform the following steps:?
Next steps?
In this article we have looked at practical concepts to find and mitigate drift of data and machine learning models for tabular use cases. We have seen these concepts in action using our demo notebook with the predictive maintenance example. We encourage you to adapt these methods on your own use cases and appreciate any feedback.??
Being able to systematically identify and manage drift of production models is a big step towards mature MLOps practices. Therefore, we recommend integrating these concepts into an end-to-end production solution like the MLOps reference architecture introduced above. The last part of our notebook includes examples how to implement drift detection using automated Azure Machine Learning Pipelines.?
Data and model drift management is only one building block of a holistic MLOps vision and architecture. Feel free to check out the documentation for general information about MLOps and how to implement it using Azure Machine Learning.?
In our examples, we have looked at tabular machine learning use cases. Drift mitigation approaches for unstructured data like images and natural language are also emerging. The Drift Detection in Medical Imaging AI repository provides a promising method of analyzing medical images in conjunction with metadata to detect model drift.?
Associate Partner, Offering Leader, Azure Data and AI at IBM Consulting
2 年You both have done a really great job writing and explaining the concepts at a deeper level. Very informative. I am sure to use it several times and share it. Thanks for your contribution and sharing. ??????
AI/ML | Cloud Native, Cloud Strategy | Private Pilot
2 年Great article, well done Natasha!
Lead Generative AI Architect at Microsoft | AI & ML Advisor | AI Community Leader
2 年This is perfect and quite timely. Was looking for something around automated data drift detection and model retraining. Thanks for sharing. :)