Introduction to Feature Engineering for Time Series Forecasting

Introduction to Feature Engineering for Time Series Forecasting

Time series forecasting is the most common and basic task that can be solved by machine learning techniques, even within the era of Generation Artificial Intelligence (GenAI) revolutionizing people's workflows. The most important and highest impact part of time series modeling is feature engineering. It takes up the most time, and is the most effective way to maximize forecasting accuracy.

Below are what I have learnt and used so far for Time Series Forecasting in my Data journey.

I. Definitions

1. Time series forecasting

Time series forecasting is a technique that utilizes historical, current and (even) futuristic data to predict future values over a period of time or a specific point in the future. With machine learning, time series forecasting is categorized as a supervised learning problem.

Time series forecasting example

2. Observation

In machine learning, an observation refers to a single instance of data in a dataset. Each observation consists of several features and, in supervised learning, a target label.

In supervised learning, models use observations to learn the relationships between features and the target variable. The quality and quantity of observations significantly impact the model's performance.

Observation example in table form

3. Feature

In the context of machine learning, a feature (also known as a variable or attribute) is an individual measurable property or characteristic of a data point that is used as input for a machine learning algorithm. Data types of features can be numerical, categorical, or text-based, and they represent different aspects of the data that are relevant to the problem.

Feature example: Number Peaks, Median, Mean, Min

4. Feature Engineering

Feature Engineering is the process of creating new features or transforming existing features to improve the performance of a machine-learning model. It involves selecting relevant information from raw data and transforming it into a format that can be easily understood by a model. Feature data is usually in the form of multiple rows or different tables for 1 observation. Flattening the data to 1 observation = 1 row with multiple feature columns will enable the model to ingest and train on the data.

Feature engineering process combines multiple columns & tables into a single, flat table

The quality of machine learning models heavily depends on the quality of the features used to train them. Combining or transforming the existing data to create new features helps to highlight the most important patterns and relationships in the data. As a result, machine learning models will learn from the data more effectively.

Feature engineering steps example

II. Time Series Feature Engineering

1. Time-based features:

Features created from time related data:

  • Holidays, number of days until next/have passed holidays…
  • Weekdays, weekends
  • Seasons of year
  • Sales days (Black Friday, 11/11…)
  • Day, month, quarter, year of date
  • Hour, minute, second of time
  • Part of the day: morning, noon, afternoon, evening, night…

This kind of feature is mostly always available at any point of time, and does not change over time. Except for big sales days, not all sales days are known beforehand, sometimes sales days only happen depending on the recent sales situation.

It is quite simple to generate time-based features, here is code example:

Data Sample
# Simple Time-based Features

import numpy as np
import polars as pl
import holidays

df = pl.read_excel(parent_dir / 'data/test_data_2020_202208.xlsx')

# years in data
df_prep = df.with_columns(year=pl.col('date').dt.year())

years = list(df_prep['year'].unique())

# ====================================
# holiday in selected years
vn_holidays = holidays.VN(years=years)

df_prep = df_prep.with_columns(is_holiday=pl.col('date').is_in(vn_holidays.keys()).cast(pl.Int8))


# ====================================
# weekdays, weekends
df_prep = df_prep.with_columns(weekday=pl.col('date').dt.weekday().cast(pl.Int8))


# ====================================
# Day, month, quarter, year of date
df_prep = df_prep.with_columns(day=pl.col('date').dt.day().cast(pl.Int8))
df_prep = df_prep.with_columns(month=pl.col('date').dt.month().cast(pl.Int8))
df_prep = df_prep.with_columns(quarter=pl.col('date').dt.quarter().cast(pl.Int8))


# ====================================
# Sales days (Black Friday, 11/11…)
# you can go crazy with the logic of sales days. Either have a fixed list of dates, or create a logic to get last friday of the month...
df_prep = df_prep.with_columns(special_day=(pl.col('day') == pl.col('month')).cast(pl.Int8))

df_prep.head(14)        
Simple Time-based Features
# More Complex Time-based Features

import numpy as np
import polars as pl
import holidays

# ====================================
# days to next holiday
# days pass holiday

# the next year holiday for future holiday of last dates of a year
years_adj = years.copy()

years_adj.append(years_adj[-1] + 1)

# the previous year holiday for past holiday of first dates of a year
years_adj.append(years_adj[0] - 1)

years_adj.sort()

print(years_adj)

vn_holidays = holidays.VN(years=years_adj)
holidays_array = np.array(list(vn_holidays.keys()))

# Convert date to numerical format for vectorized operations
def date_to_numeric(date):
    days = np.datetime64(date) - np.datetime64('1970-01-01')  # days since epoch
    return days.astype('timedelta64[D]') / np.timedelta64(1, 'D')

# Function to compute nearest holidays
def compute_nearest_holidays(df):    
    # Convert DataFrame dates to numpy array
    dates = df['date'].to_numpy()
    
    # Convert dates to numeric format (days since epoch)
    dates_numeric = np.array([date_to_numeric(date) for date in dates])
    holidays_numeric = np.array([date_to_numeric(date) for date in holidays_array])
    
    # Compute differences
    diff_matrix = dates_numeric[:, None] - holidays_numeric

    # Mask for past and future holidays
    past_mask = diff_matrix >= 0
    future_mask = diff_matrix < 0

    # Handle past holidays
    past_diff_matrix = np.where(past_mask, diff_matrix, np.inf)
    nearest_past_days = np.min(past_diff_matrix, axis=1)

    # Handle future holidays
    future_diff_matrix = np.where(future_mask, -diff_matrix, np.inf)
    nearest_future_days = np.min(future_diff_matrix, axis=1)

    # Create result DataFrame
    result_df = df.with_columns([
        pl.Series(name="days_to_nearest_past", values=nearest_past_days).cast(pl.Int16),
        pl.Series(name="days_to_nearest_future", values=nearest_future_days).cast(pl.Int16)
    ])
    
    return result_df

# Apply the function
df_prep = compute_nearest_holidays(df_prep)

df_prep.head(10)        
Days pass/to next holiday Features

2. Prediction results of other related factors:

Weather forecast, revenue of related products forecast, production volume forecast,... are some examples for prediction results of other factors related to the observations. Another prediction result type can be used as a feature is the predicted values of the time series. Beside using directly the forecast results, aggregating them is also a good option to further provide more aspects for the main time series model to learn the data pattern.

Predicted values as features

There is a possibility that the time series model will depend greatly on these features due to their similar characteristics. Therefore, It can be very risky because errors always occur in their forecast results. Bad forecast results will lead to low accuracy of the main model.

Skforecast is a useful Python library for automating the task of putting lagged values or predicted results into the model as features. Here is the visualization of how skforecast organize the data: Introduction to forecasting - Skforecast Docs

3. Lagged values:

Moving the values straight from the past to the current observation as a feature is an effective way to make a machine learning model to capture past data trends, seasonality to predict future data. The further the lagged value is in the past, the older the data trends are referred to by the model. Old trends might not be suitable for current and/or future situations, so finding the proper range of lagged values requires several running tests.

Examples:

  • Values of last 2, 7, 14, 28, 30... days.
  • Values of day t-1, t-4, t-7, t-10…
  • Values of the same date in the previous 1, 2, 3… months or quarters or years.
  • Values of the same hour in the previous 1, 2, 3… days.

Values of last 4 days as features

Code example:

import polars as pl

df_prep = df.sort(by='date', descending=False)

# ====================================
# Values of last 2, 7, 14, 28, 30... days.
lags = 30 # do not use data from the first 30 days for training. There will be a bunch of null values in lag_columns.
lag_step = 3 # Values of day t-1, t-3, t-5, t-7…

lag_columns = []
value_columns = ['x_1', 'x_2', 'label']

for c in value_columns:
    for i in range(1, lags+1, lag_step):
        lag_columns.append(pl.col(c).shift(i).alias(f"{c}_lag_{i}"))

df_prep = df_prep.with_columns(lag_columns)

df_prep.tail(5)        
Values of day t-1, t-4, t-7, t-10…
import polars as pl
import polars.selectors as cs

# ====================================
# Values of the same date in the previous 1, 2, 3… months or quarters or years.
# Example: date = 2022-09-01 -> same date in the previous 1 month = 2022-08-01
# The end date of a month is usually different from eachother. Can use the next or previous date to fill in: date = 2022-07-31 -> same date in the previous 1 month = 2022-07-01 or 2022-06-30.
# The choice depends on the characteristics of the data and the purpose of forecasting.
prev_month = 3 # do not use data from the first 3 months for training. There will be a bunch of null values in prev_3_month_value.
prev_step = 1

###############
# Version: Using previous date to fill in

# get previous month date for joining
prev_columns = []

for i in range(1, prev_month+1, prev_step):
    # by default: get previous date to fill in: 2022-06-30 for 2022-07-31
    prev_columns.append(pl.col("date").dt.offset_by(f"-{i}mo").alias(f"prev_{i}_month"))
    
    # https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.dt.offset_by.html
    # # same date last week
    # prev_columns.append(pl.col("date").dt.offset_by(f"-{i}w").alias(f"prev_{i}_week"))
    
    # # same date last quarter
    # prev_columns.append(pl.col("date").dt.offset_by(f"-{i}q").alias(f"prev_{i}_quarter"))
    
    # # same date last year
    # prev_columns.append(pl.col("date").dt.offset_by(f"-{i}y").alias(f"prev_{i}_year"))
    
df_prep = df.with_columns(prev_columns)

# get previous month data by self-joining with date = prev_{i}_month
value_columns = ['x_1', 'x_2', 'label']

for i in range(1, prev_month+1, prev_step):
    
    value_columns_rename = {'date': 'date_join'}
    
    for c in value_columns:
        value_columns_rename[c] = f"{c}_prev_{i}_month"
    
    df_prep = df_prep.join(df_prep.select(['date'] + value_columns).rename(value_columns_rename),
                           left_on=f"prev_{i}_month",
                           right_on='date_join', # by default polars not adding right_on column to result
                           how="left")

# # delete columns with name starts with "prev_"
# df_prep = df_prep.drop(cs.starts_with("prev_"))

df_prep.slice(45, 10)        
Values of the same date in the previous 1, 2, 3 months

4. Aggregated values:

Sometimes using the values as they are is not enough, transforming the values to something else can improve the model further. Aggregation can be applied on lagged values and prediction results of other related factors:

  • Average growth of the last 2, 7, 14, 28, 30... days.
  • Rolling Mean, max, min, median, standard deviation… of days t-1, t-4, t-7, t-10…

Rolling Mean, max of lagged days values

  • Rolling mean, max, min, median, standard deviation… of the same date in the previous 1, 2, 3… months or quarters or years....

Median of same date in previous months

Code example:

import polars as pl
import polars.selectors as cs

# ====================================
# Average growth of the last 2, 7, 14, 28, 30... days.
growth_period = 28 # growth of the last 28 days. Must match the number of days in lagged values ablove.

value_columns = ['x_1', 'x_2', 'label']

for c in value_columns:
    df_prep = df_prep.with_columns(((pl.col(f"{c}_lag_1") - pl.col(f"{c}_lag_{growth_period}")) / pl.col(f"{c}_lag_{growth_period}")).alias(f"{c}_growth_rate_{growth_period}_days"))


# ====================================
# Rolling Mean, max, min, median, standard deviation… of days t-1, t-4, t-7, t-10…, t-25, t-28
for c in value_columns:
    df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_lag")).list.mean().alias(f"{c}_rolling_mean"))
    df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_lag")).list.min().alias(f"{c}_rolling_min"))
    df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_lag")).list.max().alias(f"{c}_rolling_max"))
    df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_lag")).list.std().alias(f"{c}_rolling_std"))

    # for easy showing result
    df_prep = df_prep.drop(cs.starts_with(f"{c}_lag"))

df_prep.tail(10)        
Average growth of the last 28 days + Rolling Mean, max, min, median, standard deviation… of days t-1, t-4, t-7, t-10…
import polars as pl
import polars.selectors as cs

# ====================================
# Rolling mean, max, min, median, standard deviation… of the same date in the previous 1, 2, 3… months or quarters or years....
for c in value_columns:
    df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_prev_")).list.mean().alias(f"{c}_same_date_rolling_mean"))
    df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_prev_")).list.min().alias(f"{c}_same_date_rolling_min"))
    df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_prev_")).list.max().alias(f"{c}_same_date_rolling_max"))
    df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_prev_")).list.std().alias(f"{c}_same_date_rolling_std"))

    # for easy showing result
    df_prep = df_prep.drop(cs.starts_with(f"{c}_prev_"))
        
# delete columns with name starts with "prev_"
df_prep = df_prep.drop(cs.starts_with("prev_"))

df_prep.head(10)        
Rolling mean, max, min, median, standard deviation… of the same date in the previous 1, 2, 3 months

5. Automated features:

There are variable python libraries that can automatically generate features fast and efficiently. However, not all libraries are the same, some can generate distinct features, but run slowly or are difficult to learn to use. Here are some libraries with their brief descriptions:

  • Tsflex: Written in Python, Tsflex (flexible time series) is an open source library for feature extraction on time series. It is the second fastest tool, but unlike getML it is 100% written in Python. tsflex is flexible in that it makes only a few assumptions about sequence data. Link: https://github.com/predict-idlab/tsflex
  • Tsfresh: The name of this library, Tsfresh, is based on the acronym “Time Series Feature Extraction Based on Scalable Hypothesis Tests.” It is a Python package that automatically calculates and extracts several time series features for classification and regression tasks. Hence, this library is mainly used for feature engineering in time series problems and other packages like sklearn to analyze the time series. It has been found to be a pretty memory hungry tool and also lags in time efficiency. Link: https://github.com/blue-yonder/tsfresh
  • Kats: Kats (Ki5s to Analyze Time Series) is an open-source Python library developed by researchers at Facebook (now Meta). This library is easy to use and is helpful for time series problems. This is due to its very light weighted library of generic time series analysis which allows to set up the models quicker without spending so much time processing time series and calculations in different models. Link: https://github.com/facebookresearch/Kats
  • featuretools: featuretools is an open source python framework for automated feature engineering. It creates features from temporal and relational datasets for machine learning. Link: https://github.com/alteryx/featuretools
  • TSFEL: Time Series Feature Extraction Library (TSFEL for short) is a Python package for feature extraction on time series data. TSFEL automatically extracts over 60 different features on the statistical, temporal and spectral domains. It has shown average performance for runtime per feature and memory usage. Link: https://github.com/fraunhoferportugal/tsfel
  • seglearn: seglearn is an extension for multivariate, sequential time series data to the scikit learn Python library. Though seglearn starts with a relatively low memory consumption, it ends up occupying a large amount of RAM soon. Link: https://github.com/dmbee/seglearn

III. Conclusion:

The article contains brief definitions of time series forecasting, observations and features.

And some ways of creating time series features:

  1. Time-based features
  2. Prediction results of other related factors
  3. Lagged values
  4. Aggregated values
  5. Automated features

There is no limit for creativity of features engineering. It will consume a lot of time through try and error loop: create new features, put them through training, validate the model results, and repeat.

Full code example in the article: duongtruongtrong/time_series_feature_engineering: Code cheatsheets for Time series feature engineering

Reference

  1. What Is Time-Series Forecasting?
  2. Interpretable Deep Learning for Time Series Forecasting
  3. What is Feature Engineering? - GeeksforGeeks
  4. ML Glossary: Observation
  5. Skforecast: Time series forecasting with python and scikit learn
  6. ?? Feature Engineering for Time Series

要查看或添加评论,请登录

社区洞察

其他会员也浏览了