Data Leakage for Time-dependent Data and Features in Machine Learning
What’s data leakage and why should I care?
Wikipedia defines data leakage as:
"In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.” — wikipedia
Today we will focus on one kind of data leakage: time-dependent data leakage. In machine learning, the label event(or observation) happens at a certain timestamp in the past. For time-dependent data leakage, the model training uses future data to train the model.
Time-dependent data leakage will cost serious problems for your project. You spend time developing and ramping a model. Its training performance is superb but its actual performance is bad in production. Unlike other mistakes that can be captured during model training, it’s usually not until production inference time that you figured this model doesn’t work. It’s usually time-consuming to bring your model to production and you need to wait quite some more time to collect A/B testing metrics to find the problem. The time wasted from development to ramp is usually months. For machine learning teams new to this problem, they need to spend some extra weeks or even months studying the root cause. Opportunity cost is indeed high.
Symptoms, and Data leakage examples
Time-dependent data leakage is extremely hard to detect. The typical symptom is that your training result looks good(usually extremely good) but inference accuracy turns out bad(and usually really bad).
I have collected some time-dependent data leakage examples. Some data leakage is so sneaky that I can barely detect it at first sight. (The examples are simplified for readability.)
Example1: Online User Activity Data
E-commerce websites collect user activity data, like pages you have viewed, products you have rated, etc. Those activity data are timestamped. Consider an e-commerce website that has users’ product rating, users' profile data, and their purchase quantity history for that product.
If we want to train a machine learning model to predict the product rating, we can use age, balance, and purchase quantity. The purchase quantity is a time-dependent feature. If we just use the latest purchase amount whose purchase timestamp may be later than the label’s event timestamp, we are leaking future information. Let's look at user_id 1, the label timestamp is 2022-03-01. Two purchase events are related to this user id and they happened on 2022-01-02 with a quantity of 1, and 2022-04-1 with a quantity of 200. If we used both or the latest, then we are leaking future data(2022-04-1) into our training.
At best, it’s a useless feature. Worse, it may produce some good model results during training but the production inference is bad. It might be that a high product rating may indicate a re-purchase with a higher quantity. Machine learning models are smart enough to figure out those intricate relationships.
领英推荐
Example2: Bank loan with FICO score
We want to build some machine learning models to predict if we should issue the loan to clients or not. We collected some data as well as the latest FICO score as our features. The training accuracy is fantastic but the inference accuracy with real data sunk. It turns out the latest FICO score is newer than the issuance of the loan. So it’s actually the future. The latest FICO is impacted by the issuance of a loan. If a loan is issued for a client, the client’s FICO score usually goes lower in the short future. However, when doing this inference, this information doesn’t exist at all. We only have a FICO score before the loan issuance. The right way here is to use a FICO score right before the issuance of the loan instead of using the LATEST one during model training.
Example Imputing missing values for time-dependent data
Imagine we have a feature data preprocessing pipeline that computes missing time-dependent FICO scores. We simply take an average of all FICO scores and assign the preprocessing time as the timestamp for the newly imputed data. Here we have used future data to impute the average. This is also a kind of data leakage.
What Caused Time-dependent Data Leakage
A typical machine learning training dataset consists of a label and its corresponding features. There are two types of features, non-time-dependent features, and time-dependent features. For time-dependent features, it should only contain information proceeding from the timestamp when the label happened. For example, if the label event happens on t1, then the features should only contain information that happens before t1. If the feature is mixed in the information that is later than t1, then it’s a future leak. If you join the training dataset in this way, we call it point-in-time correct join.
Here is a simple illustration of point-in-time join(courtesy of https://github.com/linkedin/feathr/blob/main/docs/concepts/point-in-time-join.md). You can see it always joins feature data that is older than its label data during model training.
Let’s apply point-in-time correct join to our first example. For example, for user_id 1, we only join feature data that is older than 2022-03-01 and discard others(here we discard 2022-04-1 with a purchase quantity of 200). The same goes for other user ids.
Here is the illustration of a point-in-time join process:
How to Prevent Time-dependent Data Leakage
To prevent time-dependent data leakage, always follow best practices while preparing your features and creating a training dataset. If a feature or raw data is time-sensitive, always carry timestamp information forward to the next steps. When you are applying operations among multiple data points of different timestamps, be careful if it’s safe to apply the operations. It’s like a pearl of finance wisdom, today’s one dollar is different from tomorrow’s one dollar. The value of money decays or increases over time. Lastly, when you create a training dataset, make sure that you join them using the point-in-time correct syntax.
Have you ever experienced data leakage? Are you interested to learn other types of data leakage? Share your thoughts in the comments.
Be sure to use the proper point-in-time syntax when creating a training dataset.