登录查看更多内容

Data Leakage for Time-dependent Data and Features in Machine Learning

林航飞

领英公司软件工程师

发布日期: 2022年7月26日

+ 关注

What’s data leakage and why should I care?

Wikipedia defines data leakage as:

"In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.” — wikipedia

Today we will focus on one kind of data leakage: time-dependent data leakage. In machine learning, the label event(or observation) happens at a certain timestamp in the past. For time-dependent data leakage, the model training uses future data to train the model.

Time-dependent data leakage will cost serious problems for your project. You spend time developing and ramping a model. Its training performance is superb but its actual performance is bad in production. Unlike other mistakes that can be captured during model training, it’s usually not until production inference time that you figured this model doesn’t work. It’s usually time-consuming to bring your model to production and you need to wait quite some more time to collect A/B testing metrics to find the problem. The time wasted from development to ramp is usually months. For machine learning teams new to this problem, they need to spend some extra weeks or even months studying the root cause. Opportunity cost is indeed high.

Symptoms, and Data leakage examples

Time-dependent data leakage is extremely hard to detect. The typical symptom is that your training result looks good(usually extremely good) but inference accuracy turns out bad(and usually really bad).

I have collected some time-dependent data leakage examples. Some data leakage is so sneaky that I can barely detect it at first sight. (The examples are simplified for readability.)

Example1: Online User Activity Data

E-commerce websites collect user activity data, like pages you have viewed, products you have rated, etc. Those activity data are timestamped. Consider an e-commerce website that has users’ product rating, users' profile data, and their purchase quantity history for that product.

If we want to train a machine learning model to predict the product rating, we can use age, balance, and purchase quantity. The purchase quantity is a time-dependent feature. If we just use the latest purchase amount whose purchase timestamp may be later than the label’s event timestamp, we are leaking future information. Let's look at user_id 1, the label timestamp is 2022-03-01. Two purchase events are related to this user id and they happened on 2022-01-02 with a quantity of 1, and 2022-04-1 with a quantity of 200. If we used both or the latest, then we are leaking future data(2022-04-1) into our training.

At best, it’s a useless feature. Worse, it may produce some good model results during training but the production inference is bad. It might be that a high product rating may indicate a re-purchase with a higher quantity. Machine learning models are smart enough to figure out those intricate relationships.

领英推荐

K-Nearest Neighbors (KNN) Algorithm for…

Vrata Tech Solutions (VTS) 11 个月前

Benefits of cross-validation in model selection

AIBrilliance 2 个月前

From Data Overload to Insights in Seconds: The Role of…

Namasys Analytics 1 年前

Example2: Bank loan with FICO score

We want to build some machine learning models to predict if we should issue the loan to clients or not. We collected some data as well as the latest FICO score as our features. The training accuracy is fantastic but the inference accuracy with real data sunk. It turns out the latest FICO score is newer than the issuance of the loan. So it’s actually the future. The latest FICO is impacted by the issuance of a loan. If a loan is issued for a client, the client’s FICO score usually goes lower in the short future. However, when doing this inference, this information doesn’t exist at all. We only have a FICO score before the loan issuance. The right way here is to use a FICO score right before the issuance of the loan instead of using the LATEST one during model training.

Example Imputing missing values for time-dependent data

Imagine we have a feature data preprocessing pipeline that computes missing time-dependent FICO scores. We simply take an average of all FICO scores and assign the preprocessing time as the timestamp for the newly imputed data. Here we have used future data to impute the average. This is also a kind of data leakage.

What Caused Time-dependent Data Leakage

A typical machine learning training dataset consists of a label and its corresponding features. There are two types of features, non-time-dependent features, and time-dependent features. For time-dependent features, it should only contain information proceeding from the timestamp when the label happened. For example, if the label event happens on t1, then the features should only contain information that happens before t1. If the feature is mixed in the information that is later than t1, then it’s a future leak. If you join the training dataset in this way, we call it point-in-time correct join.

Here is a simple illustration of point-in-time join(courtesy of https://github.com/linkedin/feathr/blob/main/docs/concepts/point-in-time-join.md). You can see it always joins feature data that is older than its label data during model training.

Let’s apply point-in-time correct join to our first example. For example, for user_id 1, we only join feature data that is older than 2022-03-01 and discard others(here we discard 2022-04-1 with a purchase quantity of 200). The same goes for other user ids.

Here is the illustration of a point-in-time join process:

How to Prevent Time-dependent Data Leakage

To prevent time-dependent data leakage, always follow best practices while preparing your features and creating a training dataset. If a feature or raw data is time-sensitive, always carry timestamp information forward to the next steps. When you are applying operations among multiple data points of different timestamps, be careful if it’s safe to apply the operations. It’s like a pearl of finance wisdom, today’s one dollar is different from tomorrow’s one dollar. The value of money decays or increases over time. Lastly, when you create a training dataset, make sure that you join them using the point-in-time correct syntax.

Have you ever experienced data leakage? Are you interested to learn other types of data leakage? Share your thoughts in the comments.

Data Leakage for Time-dependent Data and Features in Machine Learning

林航飞

领英公司软件工程师

What’s data leakage and why should I care?

Symptoms, and Data leakage examples

Example1: Online User Activity Data

领英推荐

Example2: Bank loan with FICO score

Example Imputing missing values for time-dependent data

What Caused Time-dependent Data Leakage

How to Prevent Time-dependent Data Leakage

社区洞察

其他会员也浏览了

How to create a train and test dataset

Refine your data to get the most out of machine learning

Machine Learning Unleashed: Transforming Business Data into Actionable Insights

Handling Outliers in ML: Best Practices for Robust Data Preprocessing

Machine Learning Monitoring, Part 5: Why You Should Care About Data and Concept Drift

Machine Learning Essentials: Preparing Data for Success

Model Evaluation Metrics: A Comprehensive Guide

Navigating Parametric and Non-Parametric Data in Machine Learning

Standardization and Normalization Techniques in Machine Learning - Part 07

Minor Errors, Big Impacts: Typical Machine Learning Mistakes That Could Cause Your Project to Fail