From Research To Reality: Deep Learning Methods on Time Series Forecasting on Financial Data

Deep Learning methods have made significant strides in the field of Time-Series forecasting. Compared to traditional statistical methods like ARIMA, they offer superior scalability, flexibility, and potential for higher accuracy especially when data has complex patterns, high-dimensionality, and exhibits nonlinear relationships. One of the key advantages of deep learning is its ability to automatically learn complex data dependencies, bypassing the need for extensive feature engineering often required by ML methods like LightGBM or XGBoost.

?? I've explored several research papers on deep learning in time-series forecasting and applied these models to financial data using PyTorch. You can check out the results and code on GitHub here.

?? In this post, I’ve also summarized and explaied key insights from financial data perspective. Feel free to dive in and share your thoughts!

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Model-1: Transformer Model ?????????

Transformer Model is based on the paper “Attention is all you need.” ?Transformer paper fundamentally changed the landscape of large language models (LLMs) by introducing the Transformer architecture, which relies solely on the attention mechanism—specifically, self-attention—and completely replaces traditional models like RNNs, LSTMs, and GRUs for sequence modeling tasks.

Because time-series data are also sequential, transformer model can also be applied to this kind of data. Since, the paper is written from the language perspective for NLP tasks, I’ll explain the key concepts of the paper using time-series terms (in this case stock price prediction).

The attention mechanism in stock price prediction would allow the model to focus on specific historical time steps (e.g., past days, weeks, or months) that are most relevant for predicting future stock prices. This means, rather than treating all past data points equally, the model identifies which points in time (e.g., particular days or events) are the most important for making an accurate prediction. For example, it may look when the quarterly earnings were released (high relevance), days when significant market events or announcements affected stock prices and the holiday period etc.

In summary, the attention mechanism dynamically highlights which time steps (historical stock prices) matter the most for predicting today’s stock price.

Self-attention is a step further than attention mechanism. It relates every historical stock price to every other historical price within the window, allowing the model to capture complex interdependencies across time. It enables the model to learn how past stock prices influence each other and how they collectively contribute to the prediction of future prices.

For example, I’m using 60 days of historical data to predict the today’s price. With self-attention, the model doesn't just look at each day in isolation but computes how each day relates to every other day in the sequence. Like:

-??The stock price 40 days ago might have some correlation with the price 10 days ago, and the model captures this relationship.

-??The price 20 days ago might have been affected by some global market event, and the self-attention mechanism can compare that day with more recent days, understanding how that event influenced the stock price over time.

So, what the components of attention Query, Key and Value means here.

Query: The price that we want to predict. This represents the stock price for which we are gathering information from the past.

Key: This represents the data from the previous chosen time window. These represent the points in time the model compares against the current day’s price to calculate attention scores.

Value: The actual stock prices corresponding to those historical days. These values are weighted by the attention scores and aggregated to make the final prediction.

And because of multi-head attention, one head can focus on short-term trend and other head can focus on long term trends simultaneously.

?

Model-2: N-BEATS Model

The N-BEATS model is a deep learning architecture specifically designed for time-series forecasting. Introduced in the paper "N-BEATS: Neural Basis Expansion Analysis For Interpretable Time Series Forecasting," it emphasizes both performance and interpretability. Good thing about N-BEATS is that it doesn't require any feature engineering or input scaling—it works directly with raw time-series data, making it especially useful for applications like stock price prediction.

How N-BEATS Works:

The model is built from stacked blocks, and each block does two main things:

Backcast Output: This part tries to reconstruct the past values of the time series. In the context of stock prices, it attempts to recreate historical prices, helping the model understand patterns and trends that might need adjustment in later layers.
Forecast Output: This is where the model predicts future values over a certain horizon, like forecasting stock prices for the next few days or weeks.

The process is iterative:

First Block: Processes the raw input data (e.g., the past 60 days of stock prices).
Subsequent Blocks: Each block takes the residuals—the differences between the backcast and the actual input—from the previous block. This means each block focuses on refining what the previous ones didn't capture well, improving the overall forecast step by step.

Inside Each Block:

Each block consists of two key components:

Expansion Coefficients (θ): Learned through fully connected layers, these coefficients determine how much each basis function contributes to the outputs.
Basis Functions (Basis Vectors): These can be predefined functions like polynomials or sine and cosine waves, or they can be learned by the model. They capture different patterns in the data, such as trends (overall direction) and seasonality (repeating patterns).

By combining the expansion coefficients with the basis functions, the model constructs the backcast and forecast outputs.

Why It's Interpretative:

Using basis functions allows N-BEATS to break down the time series into understandable components:

Trend Components: Showing the general direction of stock price movements over time.
Seasonal Components: Highlighting repeating patterns or cycles in stock prices, like quarterly earnings impacts or annual trends.

This decomposition helps not only see the predicted stock prices but also understand the underlying factors influencing these predictions.

Applying N-BEATS to Stock Price Prediction:

Backcast Example: Again I'm taking the example to forecast the past 60 days of a stock's price data. Here the backcast output tries to reconstruct these historical prices. Any discrepancies between the backcast and the actual prices are residuals that the next block aims to minimize.
Forecast Example: Using insights from the backcast, the model predicts future stock prices. For instance, it might forecast prices for the next 7 days, taking into account the trends and seasonal patterns it has identified.
Refining Predictions: If unexpected events (like sudden news releases or economic reports) caused anomalies in stock prices that weren't fully captured by the first block, the residuals passed to the next block help the model adjust and refine its predictions accordingly

So as explained above, the combination of accuracy and interpretability makes N-BEATS a robust tool for investors, analysts, and financial institutions aiming to understand and forecast stock market trends.

Model-3: Temporal Convolutional Networks

Temporal Convolution Network (TCN) is from the paper "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling". The model focuses on how convolution network can be used in sequential data. It mainly cantered around two points:

1) The convolution in the architecture is causal, meaning that there is no information “leakage” from future to past. Basically, it means that model ensures that predictions at time ‘t’ do not have access to data from future time steps (i.e., t+1, t+2, ...), preventing data leakage and preserving the time order.

2) The architecture can take a sequence of any length and map it to an output sequence of the same length, just as with an RNN.?????????????????????????????????????????

The above two points can be combined into a single equation:

TCN = 1D FCN + Causal Convolutions

Apart from this the emphasizes was also on how to build very long effective history sizes (i.e., the ability for the networks to look very far into the past to make a prediction) using a combination of very deep networks (augmented with residual layers) and dilated convolutions.

The overall idea is to combine simplicity, autoregressive prediction, and very long memory.

Explanation about some key terms from model:

A simple causal convolution can only look back at a limited history of the input sequence. This limitation occurs because the size of the receptive field (the portion of the input sequence the model can "see" at any given time step) grows linearly with the depth of the network.

This linear growth makes it difficult to apply causal convolutions to sequence tasks that require modeling long-term dependencies, as it would require a deep network to capture a large enough receptive field, which is computationally expensive and inefficient.

So, to address this limitation dilated convolutions are introduced.

Basically, dilation factor controls the distance between elements that the filter looks at in the input sequence. When d = 1, this reduces to a normal convolution. When d > 1, the convolution becomes dilated and can skip over certain time steps. And suppose if d=2, the convolution skips every other element, effectively doubling the receptive field. So, in this way if value of “d” increases exponentially, the model can "see" further into the past without needing deeper layers or wider filters. This is because with each layer, the number of time steps the network can "reach" grows exponentially.

There is also a residual block in the network. Residual blocks in TCNs allow deep networks to learn small modifications to the identity mapping, improving gradient flow and stabilizing training. To handle input-output dimension mismatches, 1x1 convolutions are applied before the residual addition, ensuring compatible shapes for deeper layers.

?

Detailed explanation on applying TCN to stock price data:

So, here how we can explain the model working on stock data in three parts:

1) So, when predicting the future, we don't allow the model to look onto future that how data will look like. The idea is simple we don't know what will be stock prices of the future. Otherwise, if model uses future stock prices to predict today's price, it would be unrealistic and could lead to overestimating how well the model performs.

TCN model handles this by using causal convolutions. This ensures that the model's predictions are based on the same information that would have been available in a real-world scenario, making the predictions more reliable and applicable.

2) Many times stock prices shows unpredictable behaviour as they are influenced by a mix of short-term events and long-term trends. Events from months or even years ago can affect current prices—like a company's reputation, long-term investments, or regulatory changes.

TCN uses something called dilated convolutions to address this. In simple words, its looking at the every nth point instead of every single one. By adjusting the dilation factor, the model can effectively "reach back" further in time without needing more layers or becoming too large. This allows the TCN to consider patterns and information from the distant past efficiently.

3) Deep neaural network become difficult to train as they get deeper. They can suffer from problems like vanishing gradients, where the early layers learn very slowly because the learning signal weakens as it moves backward through the layers. This can make the model less effective at capturing important patterns in complex data like stock prices.

To make training more efficient, TCN incorporates residual connection. It allows the model to pass information from one layer to another without passing through all the intermediate layers. So a model has a direct path of layers instead of forcing it to take same path every time. This helps maintain strong learning signals throughout the network, allowing it to train deeper models without the usual problems. For stock price prediction, this means the model can be both deep enough to capture complex patterns and efficient enough to train effectively, leading to better performance.?

Model- 4: Temporal Fusion Transformer

?Temporal Fusion Transformer (TFT) is a neural network architecture specifically designed for time-series forecasting, introduced in the paper “Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting”. TFT extends the traditional Transformer architecture by integrating it with recurrent components and novel attention mechanisms tailored for temporal data. This model combines the strengths of Transformers and LSTMs (Long Short-Term Memory networks) to effectively handle sequential data like stock prices while providing interpretability in its predictions.

Time-series data, such as stock price movements, are inherently sequential. The TFT is adept at modeling this kind of data. Let's explore the key concepts of the Temporal Fusion Transformer using stock price prediction as an example.

Attention Mechanism in TFT:

In stock price prediction, the attention mechanisms within TFT allow the model to focus on specific historical time steps (e.g., past days, weeks, or months) and variables that are most relevant for forecasting future prices. Instead of treating all past data points and features equally, TFT identifies which time steps and features (such as economic indicators, company announcements, or global events) are most influential. For instance, it might emphasize:

Quarterly earnings releases: Recognizing their high relevance due to potential impact on stock prices.
Significant market events: Capturing days when major announcements affected the market.
Seasonal periods like holidays: Understanding patterns that recur annually.

In summary, the attention mechanism in TFT dynamically highlights which time steps and features matter most for predicting future stock prices.

Temporal Self-Attention Mechanism:

TFT employs a Temporal Self-Attention mechanism, which relates every historical stock price and feature to every other within the input window. This captures complex interdependencies across time and features, enabling the model to learn how past stock prices and covariates influence each other and collectively contribute to future price predictions.

For example, using 60 days of historical data to predict future stock prices:

Inter-day Relationships: The model might find that the stock price 40 days ago is correlated with the price 10 days ago due to underlying market cycles.
Event Impact Over Time: A significant event affecting the price 20 days ago may have a lingering influence, which the model captures by relating that day to subsequent days.

This capability allows TFT to capture both short-term and long-term dependencies in the data.

Components of Attention: Query, Key, and Value:

In TFT's attention mechanism:

Query: Represents the embedding of the current time step or the feature for which we want to gather relevant information.
Key: Represents embeddings of historical time steps and features compared against the Query to compute attention scores.
Value: Embeddings of historical time steps and features, weighted by the attention scores and aggregated to inform the prediction.

So, the Query might be the embedding of the target time step, while the Keys and Values are embeddings of past time steps and features. This setup allows the model to attend over relevant time steps and features for accurate forecasting.

Multi-Head Attention:

TFT utilizes multi-head attention, where multiple attention mechanisms (heads) operate in parallel. Each head can focus on different aspects of temporal dynamics:

Short-Term Trends: One head might capture recent price movements.
Long-Term Dependencies: Another head might focus on longer-term patterns or cycles.
Seasonal Patterns: Additional heads might detect recurring patterns related to seasons or fiscal quarters.

This parallel processing enables the model to capture a wide range of temporal relationships simultaneously.

Variable Selection Networks and Gating Mechanisms:

A significant feature of TFT is its interpretability through Variable Selection Networks and Gating Mechanisms:

Variable Selection Networks: These networks learn to weigh the importance of different input variables (features) at each time step. The model determines which features are most relevant for prediction at different times.
Gating Mechanisms: They control the flow of information, allowing the model to focus on the most pertinent information while suppressing less relevant data. This enhances both performance and interpretability.

For instance, the model might prioritize macroeconomic indicators during economic downturns or company-specific news when relevant, adjusting its focus dynamically.

Also, by using quantile regression at its output layer, the Temporal Fusion Transformer predicts different quantiles (e.g., 10th, 50th, 90th percentiles) of future stock prices, providing probabilistic forecasts based on the stock data.

??

Model-5: Informer

Informer model has been described in the research paper "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting". They are specifically designed to tackle the challenges of long-term sequence forecasting. Though Transformers have shown potential in forecasting tasks, they struggle with LSTF due to issues like quadratic time complexity, high memory usage, and limitations of the encoder-decoder architecture. Informer addresses these problems with three key innovations, which I'm going to explain from stock price prediction point of view:

1. ProbSparse Self-Attention Mechanism:

This means that instead of computing full self-attention across all tokens—which is computationally heavy—Informer introduces ProbSparse self-attention. This method focuses only on the most informative queries, significantly reducing complexity. By concentrating on queries with the largest attention scores (based on measures like Kullback-Leibler Divergence), the model can ignore those with low information contribution. This approach reduces the computation from quadratic to nearly linear time.

The above point is so true for stock data; not every piece of historical data is equally important. Some days have events that significantly impact prices—like earnings releases, economic announcements, or sudden market shifts—while other days are just business as usual. The ProbSparse self-attention mechanism focuses on these critical moments. Instead of looking at every single past time point (which can be computationally heavy), it zeroes in on the most informative ones. This means the model pays more attention to the days that actually matter for predicting future prices, making the forecasting process more efficient and accurate.

2. Self-Attention Distillation

This is a multi-scale technique that compresses long sequences by progressively halving the sequence length at each layer. It does this by aggregating neighboring time steps, effectively summarizing the data while retaining key information.

This context is very useful for the stock market data. Stock market data can be noisy and redundant. There might be periods where prices don't change much or where patterns repeat. The self-attention distillation component helps simplify this data. It compresses long sequences by grouping similar time steps together, effectively summarizing the important trends and patterns. In simple words, its like taking a long, complicated story and distlling it down to key points. This makes it easier for the model to handle long histories without getting overwhelmed, ensuring that essential information isn't lost in the noise.

3. Generative Style Decoder

Traditional Transformer models use a decoder with full attention, which can be costly. Informer's decoder applies the same sparse attention mechanism as the encoder, making it more efficient. Moreover, it operates in a generative style, predicting future time steps iteratively. This means the model feeds its previous output back into itself as part of the input for the next prediction. This approach allows for efficient forecasting over long horizons without the heavy computational demands of full attention mechanisms.

By focusing on the most informative time steps and reducing unnecessary computations, Informer is well-suited for time-series data where capturing long-range dependencies is essential for accurate predictions.

This is very much in context with stock data. Because, predicting stock prices isn't just about the next day—it often involves forecasting weeks or months into the future. The Informer's generative style decoder tackles this by predicting multiple future time steps efficiently. Instead of recalculating everything from scratch for each prediction, it generates future prices in a step-by-step manner, using its previous outputs to inform the next prediction. This is like using today's forecasted price to help predict tomorrow's, and so on. It streamlines the process, allowing for faster and more efficient long-term forecasts without sacrificing accuracy.

By integrating these components, the Informer model becomes particularly adept at stock price prediction. It handles long-term dependencies by focusing on the most relevant historical data, reduces computational load by simplifying input sequences, and efficiently generates future forecasts. This makes it a powerful tool for analysts and traders who need to make informed decisions based on extensive historical stock data.

?

Future Prospects:

In this research setup, I tried some popular deep learning algorithms for time series forecasting. However, there are other algorithms that could be explored in the future, such as Amazon's DeepAR and Chronos. Similarly, zero-shot algorithms like TimeGPT could be used to assess performance when the model has not seen any specific data. If the performance is promising—especially since TimeGPT is also trained on financial data—it might be worth considering. Additionally, other new breakthroughs in AI, such as diffusion models (though primarily used for generating images), could be interesting to try.

Furthermore, these models could be tested on datasets like M4, which encompass various types of seasonality, including daily, hourly, monthly, quaterly and yearly patterns.

Note: This research was conducted for experimental purposes only, not for any investment purposes.

#TimeSeriesForecasting #DeepLearning #AI #DataScience #FinancialData #MachineLearning #FinancialForecasting

??

?

From Research To Reality: Deep Learning Methods on Time Series Forecasting on Financial Data

Utkrisht Mallick

AI/ML Engineer

Model-1: Transformer Model ?????????

Model-2: N-BEATS Model

Model-3: Temporal Convolutional Networks

领英推荐

Model- 4: Temporal Fusion Transformer

Model-5: Informer

Future Prospects:

Utkrisht Mallick的更多文章

社区洞察

其他会员也浏览了

How to Find Duplicates in a Corpus with NLP, Understanding Your Neural Network’s Predictions, and a Black Friday Deal

AutoKeras - A new revolution into Deep Learning

Interpreting Machine Learning Models, Visualizing Decision Trees, and New Webinars

Why is Deep Learning Preferred Over Machine Learning?

Transcript of interview of Ian Goodfellow by Lex Fridman

Key Differences Between Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL)

From Statistics to Deep Learning: The Evolution of AI Explainability

Data Science Redefined: Explore the Latest Developments in AI and Machine Learning for Data Scientists

Data Science Explained!

What is machine learning, and how does it differ from other algorithms, particularly deep learning?

Model-1: Transformer Model ?????????

Model-2: N-BEATS Model

Model-3: Temporal Convolutional Networks

领英推荐

Model- 4: Temporal Fusion Transformer

Model-5: Informer

Future Prospects:

Utkrisht Mallick的更多文章

Face Detection using Haar Cascade Algorithm

社区洞察

其他会员也浏览了

How to Find Duplicates in a Corpus with NLP, Understanding Your Neural Network’s Predictions, and a Black Friday Deal

AutoKeras - A new revolution into Deep Learning

Interpreting Machine Learning Models, Visualizing Decision Trees, and New Webinars

Why is Deep Learning Preferred Over Machine Learning?

Transcript of interview of Ian Goodfellow by Lex Fridman

Key Differences Between Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL)

From Statistics to Deep Learning: The Evolution of AI Explainability

Data Science Redefined: Explore the Latest Developments in AI and Machine Learning for Data Scientists

Data Science Explained!

What is machine learning, and how does it differ from other algorithms, particularly deep learning?