TimesFM: A Foundation Model Revolutionizing Time-Series Forecasting

TimesFM: A Foundation Model Revolutionizing Time-Series Forecasting

Time-series data, like stock prices or weather patterns, is everywhere. Predicting the future of this data – forecasting – is crucial for many applications, from optimizing supply chains to predicting energy demands. Traditionally, specialized models are built for each forecasting task, demanding significant time and resources. But what if we could have a single, powerful model capable of accurate forecasting across diverse datasets, straight out-of-the-box?

Inspired by the success of large language models (LLMs) in natural language processing, researchers have introduced TimesFM: a decoder-only foundation model for time-series forecasting. This innovative model is trained on a massive corpus of both real-world and synthetic time-series data, enabling it to achieve remarkable zero-shot accuracy on unseen datasets, rivalling the performance of supervised models meticulously trained for specific tasks.

This blog post delves into the intricacies of TimesFM, simplifying its concepts and architecture for beginners while providing a comprehensive overview of its capabilities.

The Power of Pretraining: A Parallel with LLMs :

Imagine you're learning a new language. If you've already mastered a similar language, you'll pick up the new one much faster. This is analogous to how foundation models like LLMs work. By training on vast amounts of text data, they learn general language patterns, making them adaptable to various tasks.

TimesFM adopts a similar strategy. By pretraining on a massive time-series dataset, it learns fundamental temporal patterns, allowing it to generalize well to unseen time-series data, even across different domains, granularities (like hourly or daily data), and forecast horizons.

Navigating the Architecture:

We provide an illustration of the TimesFM model architecture during training, where we show a input time-series of a specific length that can be broken down into input patches. Each patch along is processed into a vector by a residual block (as defined in the model definition) to the model dimension of the transformer layers. The vector is then added to positional encodings and fed into nl stacked transformer layers. SA refers to self-attention (note that we use multi-head causal attention) and FFN is the fully connected layer in the transformer. The output tokens are then mapped through a residual block to an output of size output_patch_len, which is the forecast for the time window following the last input patch seen by the model so far.

Let's dissect TimesFM's architecture, uncovering the key elements that contribute to its exceptional performance:

1. Patching: Breaking Down the Data

Much like words form sentences, time-series data can be segmented into meaningful chunks. TimesFM employs a patching technique, dividing the time-series into smaller, more manageable "patches". These patches, akin to tokens in LLMs, provide a structured way for the model to process the temporal information. This approach improves computational efficiency and allows the model to handle varying context lengths during training and inference.

2. Decoder-Only Model: Predicting the Future from the Past

TimesFM employs a decoder-only architecture, similar to LLMs like GPT. This means it learns to predict the next patch based on all preceding patches. This causal nature enables efficient parallel processing and empowers the model to predict the future based on varying past information (context).

3. Longer Output Patches: Increasing Efficiency

Unlike LLMs that generate one token at a time, TimesFM can predict larger chunks of the future using longer output patches. This significantly reduces the number of autoregressive steps required, especially for long-horizon forecasting, enhancing prediction efficiency.

4. Patch Masking: Handling Diverse Context Lengths

To prevent the model from becoming overly reliant on specific context lengths (multiples of the input patch length), TimesFM utilizes a clever patch masking strategy. During training, random portions of patches, or even entire patches, are masked. This forces the model to learn from diverse context lengths, making it adaptable to various forecasting scenarios.

5. Input and Output Layers: Transforming the Data

TimesFM uses residual blocks – essentially multi-layer perceptrons with skip connections – to process the input patches into vectors compatible with the transformer layers. Similarly, another residual block maps the output tokens from the transformer to the final forecasts.

Detailed Processing and Prediction in TimesFM :

The input time-series is first divided into patches, with each patch processed by a Residual Block—a type of neural network layer that aids in training deeper networks—and combined with Positional Encoding to retain the sequence order. These patches are then passed through Stacked Transformer Layers, consisting of multiple Self-Attention (SA) and Feed-Forward Networks (FFN), to capture complex patterns and dependencies in the data. The processed patches are mapped to forecasted values using another Residual Block, allowing the model to predict larger chunks of future values at once. The model is trained using the Mean Squared Error (MSE) loss function, which measures the average squared difference between the predicted and actual values, mathematically defined as :

Building the Foundation (A Rich and Diverse Pretraining Dataset) :

The success of TimesFM hinges on its pretraining dataset. A vast and diverse collection of time-series data is crucial for the model to learn a wide range of temporal patterns. The researchers curated a massive dataset comprising over 100 billion timepoints from various sources:

- Real-world data: This includes data from Google Trends, capturing search interest over time for millions of queries, and Wiki Pageview statistics, providing insights into the hourly views of Wikimedia pages.

- Synthetic data: To ensure representation of various temporal dynamics, they incorporated synthetic time-series data generated from models like ARMA, mimicking common seasonal patterns, trends, and step functions.

- Publicly available datasets: Additional time-series data from sources like the M4 competition, electricity and traffic datasets, and weather data further enriched the pretraining corpus, enhancing the model's ability to generalize to different domains and granularities.

Zero-Shot Performance ( Putting TimesFM to the Test ) :

The true power of TimesFM lies in its zero-shot forecasting capabilities. The researchers evaluated its performance on three popular benchmark datasets, deliberately excluded from the pretraining data:

We report average performance in three groups of datasets. In all figures, the lower the metric the better and the error bars represent one standard error. Note that among the baselines only TimesFM and llmtime are zero-shot. In (a) we report results on the Monash datasets. Since the datasets have different scales, we take the Geometric Mean (GM) of the MAE’s scaled by the MAE of a naive baseline. We can see that TimesFM is the top model. In (b), we report the similarly scaled MAE on the Darts benchmarks. TimesFM is within significance of the best performing methods which are ARIMA and llmtime in this case. Note that these datasets have one time-series each and therefore statistical methods are competitive with deep learning ones. Finally, in (c) we report the average MAE for 96 and 192 horizon prediction tasks on 4 ETT datasets i.e 8 tasks in total. TimesFM and PatchTST are the best performing models

- Monash archive: This diverse collection of 30 datasets, covering various domains and granularities, served as a challenging testbed. TimesFM achieved remarkable results, surpassing the performance of even specialized supervised models trained on these datasets.

- Darts: This collection of 8 univariate datasets, known for their complex seasonal patterns and trends, further showcased TimesFM's ability to generalize well to intricate temporal dynamics.

- Informer datasets: Designed for long-horizon forecasting, these datasets provided a rigorous evaluation for TimesFM's capabilities. The model excelled on these datasets, achieving state-of-the-art accuracy.

The Advantages of TimesFM ( A Paradigm Shift in Forecasting ) :

TimesFM marks a significant paradigm shift in time-series forecasting, offering numerous advantages:

- Zero-shot accuracy: Its pretrained nature allows it to provide accurate forecasts on diverse datasets without requiring any additional training, saving significant time and resources.

- Generality: Unlike specialized models, TimesFM can generalize well to different domains, forecast horizons, and temporal granularities.

- Efficiency: The model's architecture is optimized for computational efficiency, particularly for long-horizon forecasting, thanks to longer output patches and parallel processing.

- Accessibility: The release of pretrained TimesFM models will democratize access to advanced forecasting capabilities, empowering users across various domains.

Usage :

Initialize and load the model

import numpy as np
import pandas as pd
import timesfm

tfm = timesfm.TimesFm(
    context_len=<context>,
    horizon_len=<horizon>,
    input_patch_len=32,
    output_patch_len=128,
    num_layers=20,
    model_dims=1280,
    backend=<backend>,
)
tfm.load_from_checkpoint(repo_id="google/timesfm-1.0-200m")        

Example with array inputs :

forecast_input = [
    np.sin(np.linspace(0, 20, 100)),
    np.sin(np.linspace(0, 20, 200)),
    np.sin(np.linspace(0, 20, 400)),
]
frequency_input = [0, 1, 2]

# Performing Inference
point_forecast, experimental_quantile_forecast = tfm.forecast(
    forecast_input,
    freq=frequency_input,
)        

Example with pandas dataframe :

input_df = pd.read_csv("your_input.csv")  # Load your data

# Performing Inference
forecast_df = tfm.forecast_on_df(
    inputs=input_df,
    freq="M",  # monthly
    value_name="y",
    num_jobs=-1,
)        

Conclusion

TimesFM represents a significant advancement in time-series forecasting, leveraging the principles of large language models to achieve impressive zero-shot performance. Its innovative use of patching, decoder-only architecture, and efficient training techniques make it a powerful tool for various forecasting tasks.


Paper: A decoder-only foundation model for time-series forecasting

Google Research blog

Hugging Face checkpoint repo

By Kirouane Ayoub

Nice explanation and thanks for the write up. Does it work on any time series data? Even for domains that it has not been trained on?

回复
Anas BAHOU ????

AI Engineer & Data Scientist | Seeking Opportunities in AI and Data Science

6 个月

Great work Ayoub! This is something I have been looking for to try. Does it work on multivariate time series?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了