Introduction to Feature Engineering for Time Series Forecasting
Time series forecasting is the most common and basic task that can be solved by machine learning techniques, even within the era of Generation Artificial Intelligence (GenAI) revolutionizing people's workflows. The most important and highest impact part of time series modeling is feature engineering. It takes up the most time, and is the most effective way to maximize forecasting accuracy.
Below are what I have learnt and used so far for Time Series Forecasting in my Data journey.
I. Definitions
1. Time series forecasting
Time series forecasting is a technique that utilizes historical, current and (even) futuristic data to predict future values over a period of time or a specific point in the future. With machine learning, time series forecasting is categorized as a supervised learning problem.
2. Observation
In machine learning, an observation refers to a single instance of data in a dataset. Each observation consists of several features and, in supervised learning, a target label.
In supervised learning, models use observations to learn the relationships between features and the target variable. The quality and quantity of observations significantly impact the model's performance.
3. Feature
In the context of machine learning, a feature (also known as a variable or attribute) is an individual measurable property or characteristic of a data point that is used as input for a machine learning algorithm. Data types of features can be numerical, categorical, or text-based, and they represent different aspects of the data that are relevant to the problem.
4. Feature Engineering
Feature Engineering is the process of creating new features or transforming existing features to improve the performance of a machine-learning model. It involves selecting relevant information from raw data and transforming it into a format that can be easily understood by a model. Feature data is usually in the form of multiple rows or different tables for 1 observation. Flattening the data to 1 observation = 1 row with multiple feature columns will enable the model to ingest and train on the data.
The quality of machine learning models heavily depends on the quality of the features used to train them. Combining or transforming the existing data to create new features helps to highlight the most important patterns and relationships in the data. As a result, machine learning models will learn from the data more effectively.
II. Time Series Feature Engineering
1. Time-based features:
Features created from time related data:
This kind of feature is mostly always available at any point of time, and does not change over time. Except for big sales days, not all sales days are known beforehand, sometimes sales days only happen depending on the recent sales situation.
It is quite simple to generate time-based features, here is code example:
# Simple Time-based Features
import numpy as np
import polars as pl
import holidays
df = pl.read_excel(parent_dir / 'data/test_data_2020_202208.xlsx')
# years in data
df_prep = df.with_columns(year=pl.col('date').dt.year())
years = list(df_prep['year'].unique())
# ====================================
# holiday in selected years
vn_holidays = holidays.VN(years=years)
df_prep = df_prep.with_columns(is_holiday=pl.col('date').is_in(vn_holidays.keys()).cast(pl.Int8))
# ====================================
# weekdays, weekends
df_prep = df_prep.with_columns(weekday=pl.col('date').dt.weekday().cast(pl.Int8))
# ====================================
# Day, month, quarter, year of date
df_prep = df_prep.with_columns(day=pl.col('date').dt.day().cast(pl.Int8))
df_prep = df_prep.with_columns(month=pl.col('date').dt.month().cast(pl.Int8))
df_prep = df_prep.with_columns(quarter=pl.col('date').dt.quarter().cast(pl.Int8))
# ====================================
# Sales days (Black Friday, 11/11…)
# you can go crazy with the logic of sales days. Either have a fixed list of dates, or create a logic to get last friday of the month...
df_prep = df_prep.with_columns(special_day=(pl.col('day') == pl.col('month')).cast(pl.Int8))
df_prep.head(14)
# More Complex Time-based Features
import numpy as np
import polars as pl
import holidays
# ====================================
# days to next holiday
# days pass holiday
# the next year holiday for future holiday of last dates of a year
years_adj = years.copy()
years_adj.append(years_adj[-1] + 1)
# the previous year holiday for past holiday of first dates of a year
years_adj.append(years_adj[0] - 1)
years_adj.sort()
print(years_adj)
vn_holidays = holidays.VN(years=years_adj)
holidays_array = np.array(list(vn_holidays.keys()))
# Convert date to numerical format for vectorized operations
def date_to_numeric(date):
days = np.datetime64(date) - np.datetime64('1970-01-01') # days since epoch
return days.astype('timedelta64[D]') / np.timedelta64(1, 'D')
# Function to compute nearest holidays
def compute_nearest_holidays(df):
# Convert DataFrame dates to numpy array
dates = df['date'].to_numpy()
# Convert dates to numeric format (days since epoch)
dates_numeric = np.array([date_to_numeric(date) for date in dates])
holidays_numeric = np.array([date_to_numeric(date) for date in holidays_array])
# Compute differences
diff_matrix = dates_numeric[:, None] - holidays_numeric
# Mask for past and future holidays
past_mask = diff_matrix >= 0
future_mask = diff_matrix < 0
# Handle past holidays
past_diff_matrix = np.where(past_mask, diff_matrix, np.inf)
nearest_past_days = np.min(past_diff_matrix, axis=1)
# Handle future holidays
future_diff_matrix = np.where(future_mask, -diff_matrix, np.inf)
nearest_future_days = np.min(future_diff_matrix, axis=1)
# Create result DataFrame
result_df = df.with_columns([
pl.Series(name="days_to_nearest_past", values=nearest_past_days).cast(pl.Int16),
pl.Series(name="days_to_nearest_future", values=nearest_future_days).cast(pl.Int16)
])
return result_df
# Apply the function
df_prep = compute_nearest_holidays(df_prep)
df_prep.head(10)
2. Prediction results of other related factors:
Weather forecast, revenue of related products forecast, production volume forecast,... are some examples for prediction results of other factors related to the observations. Another prediction result type can be used as a feature is the predicted values of the time series. Beside using directly the forecast results, aggregating them is also a good option to further provide more aspects for the main time series model to learn the data pattern.
There is a possibility that the time series model will depend greatly on these features due to their similar characteristics. Therefore, It can be very risky because errors always occur in their forecast results. Bad forecast results will lead to low accuracy of the main model.
领英推荐
Skforecast is a useful Python library for automating the task of putting lagged values or predicted results into the model as features. Here is the visualization of how skforecast organize the data: Introduction to forecasting - Skforecast Docs
3. Lagged values:
Moving the values straight from the past to the current observation as a feature is an effective way to make a machine learning model to capture past data trends, seasonality to predict future data. The further the lagged value is in the past, the older the data trends are referred to by the model. Old trends might not be suitable for current and/or future situations, so finding the proper range of lagged values requires several running tests.
Examples:
import polars as pl
df_prep = df.sort(by='date', descending=False)
# ====================================
# Values of last 2, 7, 14, 28, 30... days.
lags = 30 # do not use data from the first 30 days for training. There will be a bunch of null values in lag_columns.
lag_step = 3 # Values of day t-1, t-3, t-5, t-7…
lag_columns = []
value_columns = ['x_1', 'x_2', 'label']
for c in value_columns:
for i in range(1, lags+1, lag_step):
lag_columns.append(pl.col(c).shift(i).alias(f"{c}_lag_{i}"))
df_prep = df_prep.with_columns(lag_columns)
df_prep.tail(5)
import polars as pl
import polars.selectors as cs
# ====================================
# Values of the same date in the previous 1, 2, 3… months or quarters or years.
# Example: date = 2022-09-01 -> same date in the previous 1 month = 2022-08-01
# The end date of a month is usually different from eachother. Can use the next or previous date to fill in: date = 2022-07-31 -> same date in the previous 1 month = 2022-07-01 or 2022-06-30.
# The choice depends on the characteristics of the data and the purpose of forecasting.
prev_month = 3 # do not use data from the first 3 months for training. There will be a bunch of null values in prev_3_month_value.
prev_step = 1
###############
# Version: Using previous date to fill in
# get previous month date for joining
prev_columns = []
for i in range(1, prev_month+1, prev_step):
# by default: get previous date to fill in: 2022-06-30 for 2022-07-31
prev_columns.append(pl.col("date").dt.offset_by(f"-{i}mo").alias(f"prev_{i}_month"))
# https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.dt.offset_by.html
# # same date last week
# prev_columns.append(pl.col("date").dt.offset_by(f"-{i}w").alias(f"prev_{i}_week"))
# # same date last quarter
# prev_columns.append(pl.col("date").dt.offset_by(f"-{i}q").alias(f"prev_{i}_quarter"))
# # same date last year
# prev_columns.append(pl.col("date").dt.offset_by(f"-{i}y").alias(f"prev_{i}_year"))
df_prep = df.with_columns(prev_columns)
# get previous month data by self-joining with date = prev_{i}_month
value_columns = ['x_1', 'x_2', 'label']
for i in range(1, prev_month+1, prev_step):
value_columns_rename = {'date': 'date_join'}
for c in value_columns:
value_columns_rename[c] = f"{c}_prev_{i}_month"
df_prep = df_prep.join(df_prep.select(['date'] + value_columns).rename(value_columns_rename),
left_on=f"prev_{i}_month",
right_on='date_join', # by default polars not adding right_on column to result
how="left")
# # delete columns with name starts with "prev_"
# df_prep = df_prep.drop(cs.starts_with("prev_"))
df_prep.slice(45, 10)
4. Aggregated values:
Sometimes using the values as they are is not enough, transforming the values to something else can improve the model further. Aggregation can be applied on lagged values and prediction results of other related factors:
import polars as pl
import polars.selectors as cs
# ====================================
# Average growth of the last 2, 7, 14, 28, 30... days.
growth_period = 28 # growth of the last 28 days. Must match the number of days in lagged values ablove.
value_columns = ['x_1', 'x_2', 'label']
for c in value_columns:
df_prep = df_prep.with_columns(((pl.col(f"{c}_lag_1") - pl.col(f"{c}_lag_{growth_period}")) / pl.col(f"{c}_lag_{growth_period}")).alias(f"{c}_growth_rate_{growth_period}_days"))
# ====================================
# Rolling Mean, max, min, median, standard deviation… of days t-1, t-4, t-7, t-10…, t-25, t-28
for c in value_columns:
df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_lag")).list.mean().alias(f"{c}_rolling_mean"))
df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_lag")).list.min().alias(f"{c}_rolling_min"))
df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_lag")).list.max().alias(f"{c}_rolling_max"))
df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_lag")).list.std().alias(f"{c}_rolling_std"))
# for easy showing result
df_prep = df_prep.drop(cs.starts_with(f"{c}_lag"))
df_prep.tail(10)
import polars as pl
import polars.selectors as cs
# ====================================
# Rolling mean, max, min, median, standard deviation… of the same date in the previous 1, 2, 3… months or quarters or years....
for c in value_columns:
df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_prev_")).list.mean().alias(f"{c}_same_date_rolling_mean"))
df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_prev_")).list.min().alias(f"{c}_same_date_rolling_min"))
df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_prev_")).list.max().alias(f"{c}_same_date_rolling_max"))
df_prep = df_prep.with_columns(pl.concat_list(cs.starts_with(f"{c}_prev_")).list.std().alias(f"{c}_same_date_rolling_std"))
# for easy showing result
df_prep = df_prep.drop(cs.starts_with(f"{c}_prev_"))
# delete columns with name starts with "prev_"
df_prep = df_prep.drop(cs.starts_with("prev_"))
df_prep.head(10)
5. Automated features:
There are variable python libraries that can automatically generate features fast and efficiently. However, not all libraries are the same, some can generate distinct features, but run slowly or are difficult to learn to use. Here are some libraries with their brief descriptions:
III. Conclusion:
The article contains brief definitions of time series forecasting, observations and features.
And some ways of creating time series features:
There is no limit for creativity of features engineering. It will consume a lot of time through try and error loop: create new features, put them through training, validate the model results, and repeat.
Full code example in the article: duongtruongtrong/time_series_feature_engineering: Code cheatsheets for Time series feature engineering
Principal Software Engineer
1 个月Hai Nguyen (Harris)