Data Leakage in Machine Learning – avoiding the trap
Sanchit Tiwari
Associate Partner at McKinsey & Company I Senior Principal at QuantumBlack, AI by McKinsey
Data leakage is one of the most frequent mistake happens during our machine learning model building and it can happen anywhere in the model building life cycle starting from data collection to deployment. Good thing is that you don’t need to know everything about machine learning or computer science to understand the basics of data leakage. In this post we will understand the basics of data leakage and its types, we will also learn how can we avoid that mistakes to happen during the ML model development life cycle. Data leakage sometimes also referred just as leakage which describes the situation where the data we are using to train the machine learning algorithm includes unexpected extra information about the target variable we are trying to predict, we can say data leakage happens during training the model if it is getting trained on the data about our target value which will not be available in real time when the model actually get deployed in the real life. For example let us say we build our ML model to predict the customers who will churn and our model performs way better than we expect it to on the test set and too good to be true, we feel that no model can perform that better but our evaluation parameter suggest that we might have got the best possible model , and when we deployed that model in real life something weird happened. Model performance got worse in production and doesn’t make a single accurate prediction and it is not even better than random ( i.e. flipping a coin). Same model that we trained and tested for several weeks to improve the performance model is not even performing better than random. We are wondering what is the reason and the answer is leakage which is the mistake we made at the start where we allow data leakage.
So as we understand that when and how data leakage occurs, it typically causes too optimistic results during your model development phase which are completely different and disappointing after the model is actually deployed and evaluated on new data. Leakage can cause your algorithm to learn a sub optimal model that does much worse in actual deployment than a model developed in a leak free setting and it has big implications in the real world ranging from occurring extra cost to loss in revenue, for example you are spending your cost on targeting the wrong customers who are getting recommended by the model trained on leakage and that is eventually hurting customers perception of our campaign’s quality and also impacting company's brand. For these reasons, data leakage is one of the most serious and top mistake in data science and something that as a data scientist we must always avoid it.
As we understand data leakage now let us take few examples and understand how that data leakage happens in these examples, in the 1st example of leakage let us say we are trying to predict if a customer on a bank's website was likely to open an account. If the user's record contains an account number field, it might normally be empty for users still on the process of exploring the site but eventually it's filled in once the user does open an account. Clearly the user account field is not a legitimate feature that should be used in this case, because it may not be available at the time the user is still exploring the site so using the account number for predicting the likelihood of customer opening an account is not a legitimate. In 2nd example let us understand how the future information leaking in the past occur, let us say we are developing a model for diagnostic test to predict a particular medical condition. The existing patient data set might contain a binary variable that happens to mark whether or not the patient had surgery for that condition and off course that variable would be highly predictive of the medical condition…?? these leaked features are highly predictive of the target but not legitimately available at the time of prediction in real life. So as we see that there are many different ways data leakage could occur in a training set and in fact, it's often the case that more than one leakage problem is present at once.
We can divide leakage into two main types:- Leakage in training data where test data or future data gets mixed into the training data and leakage in features where something highly informative about the true label gets included as a feature. Leakage in training data can happen due to data pre-processing on the entire data set that influence the pattern during training. For example while computing parameters for normalizing/rescaling or finding minimum/maximum feature values to detect and remove outliers and using the distribution of a variable across the entire data set to estimate missing values in the training set, also leakage in training data occurs while working with time series data where records for future events are getting used to compute features for a specific prediction. Leakage in features includes the case where we have a variable like account ID which we remove but neglect to also remove other variables known as proxy variables that contain the same or similar information as account is. Sometime sensitive data fields are anonymized that contain specific information about a customer such their personal details and based on the target variable having no anonymization can give important data that makes our prediction better but that legitimately will not be available in real life.
So now we understand what is data leakage and its type also when it can happen but we don’t know how we detect and avoid data leakage in our projects. Let us start with steps before building the model, EDA(i.e. exploratory data analysis) can give hidden insights in the data. For example, we can see which features are highly correlated with the target variable. From our earlier example binary feature that indicated a patient had a particular surgical procedure for the condition will have very highly correlation with our target variable. After building the model see for feature behavior in the fitted model for example high feature weights or very high information gains associated with variable. Also most important see for the overall surprising model performance for example if your model evaluation results are substantially higher than the same or similar problems and similar data sets, then look closely at the instances or features that have most influence on the model. If possible one best practice we can follow is to do limited real life deployment of the trained model and see the difference between performance of the model during development with actual. For example let us say we developed a model which needed to be delayed in all 100 stores in the country then to follow the best practice we can deploy the model in only 1 stores and see the performance of the model and looks for the surprising results which can be due to data leakage while training the model.
Another best practice during data preparation is do it within each cross-validation fold for example if you're scaling the data then it should be based on the data available in the cross-validation split and not the entire data set and ensure that we use the same parameters on the corresponding held out test fold. While working with time series data, keep track of the time stamp that's associated with processing a particular data instance as that will ensure that we are not including information from the future in your current feature calculations or training data. At the start we can keep a completely separate data set to test real world deployment as that will help in knowing if our model generalize well to new data or not and if there is significant drop in performance then leakage maybe one contributing factor along with the usual reasons such as over fitting.
It seems easy to avoid data leakage, right? But in today’s world in every data science project data is collected from different sources( internal/external) and that involves many people who has no understanding of data leakage so as a data scientist we should always get a detailed understanding of data collection and generation as that will makes leakage easy to detect and avoid. To sum things up, in this post we understood what is data leakage and their types, how usually data leakage introduced in training data and features and discussed time series example where the features are future introduced leakage, we also discussed how can be avoid the data leakage in each phase of model building that includes data collection > data preparation > feature engineering > data split > training > tuning > evaluation > deployment.
--Machine Learning Engineer| Machine Learning Solutions| Deep Learning| Python Engineer
2 å¹´This was very informative ??
Bridging business needs with valuable solutions! CBAP, PMP, CSM, ITIL & COBIT
6 å¹´Machine Learning is the future! this already has applications in areas such as management, marketing, sales and retail and is a promising branch that will bring even more innovations to the corporate market. By applying the right methodologies and using an appropriate data set, there is the possibility to predict - with good confidence - business opportunities that would hardly be discovered by human analysis. This is therefore a great competitive advantage, which will bring many advantages and stand out from your competitors.
Financial Training | Business Finance Training | Business Acumen | Financial Understanding | Financial Wellness
6 å¹´Wow Sanchit, great write up. Business owners really need to consider this.