Forecasting Stock Prices and Realized Volatility: A Hybrid Approach Using LSTM, SARIMAX, and Topological Data Analysis

Author: Larry Liang

Date: 2024/10/29

Abstract


This paper presents a hybrid approach to stock price and volatility forecasting, integrating machine learning models, traditional statistical techniques, and Topological Data Analysis (TDA). Specifically, we use a Long Short-Term Memory (LSTM) network to predict stock closing prices, a SARIMAX model to forecast realized volatility, and Wasserstein Distance (WD) from TDA to capture topological patterns in price changes. In addition, SHAP (SHapley Additive exPlanations) interpretation is used to enhance the transparency of the LSTM predictions. Our findings highlight the effectiveness of this hybrid framework for both price prediction and volatility forecasting, providing valuable insights for traders and portfolio managers.


Introduction

Financial markets are inherently complex, requiring sophisticated models to forecast stock prices and volatility. Traditional models such as ARIMA and SARIMA rely on time-series trends, while machine learning approaches can capture non-linear dependencies. This study aims to combine the strengths of both paradigms.

Additionally, we introduce Topological Data Analysis (TDA) to quantify persistent homology in price changes, enriching a novel feature set with Wasserstein Distance (WD) values that captures Hidden Market Patterns: WD reveals structural dependencies in market returns that are missed by traditional metrics and enhances Prediction because during Volatile Periods, WD improves the model’s ability to handle nonlinear market behavior. This hybrid framework offers a robust toolset for price forecasting, risk management, and decision-making.


Methodology

Data Collection and Preprocessing

We obtained historical data for Pinduoduo Inc. (PDD) over a period of 5 years from Yahoo Finance. The primary features used include:

  1. Closing Price (scaled): The target variable for price prediction.
  2. Percentage Change in Close Price: To model realized volatility.
  3. Wasserstein Distance (WD): Derived from TDA to capture topological patterns.

The MinMaxScaler was applied to normalize the Close prices, ensuring they fit the input requirements of the LSTM model.


Wasserstein Distance (WD) for Topological Insights

We applied persistent homology using the ripser library to derive Wasserstein Distances (WD) between consecutive days' price changes. WD captures hidden topological structures in the data, which are used as features in the LSTM model.


LSTM for Price Prediction

The LSTM neural network was trained using 3-day sliding windows of scaled prices and WD values. LSTM was chosen for its ability to capture long-term dependencies in sequential data. The predictions were inverse-transformed back to the original scale to ensure interpretability.

# LSTM Model Definition

model = Sequential()

model.add(LSTM(50, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2])))

model.add(Dropout(0.2))

model.add(Dense(1))

model.compile(optimizer='adam', loss='mse')

model.fit(X_train, y_train, epochs=10, batch_size=32)


Auto-Tuned SARIMAX for Realized Volatility Forecasting

We used auto_arima from the pmdarima library to find the optimal parameters for the SARIMAX model. The best parameters were selected based on AIC values, ensuring the best fit for volatility forecasting.

auto_model = auto_arima(

data['pct_change'], start_p=1, max_p=3, start_q=1, max_q=3,

seasonal=True, m=6, start_P=0, max_P=2, start_Q=0, max_Q=2, D=1,

trace=True, stepwise=True

)

best_order = auto_model.order

best_seasonal_order = auto_model.seasonal_order

sarimax_model = SARIMAX(data['pct_change'], order=best_order, seasonal_order=best_seasonal_order)

sarimax_results = sarimax_model.fit()


SHAP Interpretation for Model Transparency

To enhance the interpretability of the LSTM model, we used SHAP values. SHAP identifies the contribution of each input feature to the final prediction, enabling transparent decision-making.

explainer = shap.KernelExplainer(model_predict, shap.sample(X_train_reshaped, 100))

shap_values = explainer.shap_values(X_test_reshaped)

shap.summary_plot(np.array(shap_values).squeeze(-1), X_test_reshaped, feature_names=feature_names)


Results

Predicted Close Prices (Original Scale) for the Next 6 Days


Prediction made by LSTM


The LSTM predictions reveal minor fluctuations in the stock price over the forecast period, with a slight downward trend toward Day 6. These predictions provide actionable insights for short-term investors looking to plan their entry and exit points.


Forecasted Realized Volatility Using SARIMAX


Realized Volatility predicted by SARIMAX

The SARIMAX model captures volatility swings, with significant dips on Day 4 and a recovery by Day 6. This forecast is essential for risk management, helping investors prepare for potential market instability.


Shap explains 6 features

The SHAP summary plot visualizes the impact of the 6 features on the LSTM model's predictions. Each dot on the plot represents the SHAP value for a given feature and sample, showing how much each feature contributes to increasing or decreasing the predicted value.

Features and Their Interpretations:


Key Insights from the SHAP Plot:

  1. Closing Prices Drive the Predictions:
  2. Minimal Impact of WD (Wasserstein Distance):
  3. Color Gradient and Impact:



Altogether, This interpretability tool has gained widespread recognition in machine learning literature for its ability to provide transparent and comprehensible explanations of model predictions, thereby bridging the gap between advanced machine learning techniques and practical decision-making. [5]

  • Most impactful features: Scaled closing prices from days 1, 2, and 3 (Features 1, 3, and 5).
  • Least impactful features: Wasserstein Distance values for the same days (Features 2, 4, and 6).
  • Interpretation: The model primarily relies on historical price trends to make predictions, while the topological insights from WD contribute minimally. This insight can help refine the model by either exploring better topological metrics or focusing on other financial indicators.

This SHAP analysis provides a transparent view of the LSTM model's predictions, helping us understand which features matter most and why.


Discussion and Insights

The combination of LSTM for price prediction, SARIMAX for volatility forecasting, and Wasserstein Distance from TDA offers a comprehensive framework for financial forecasting.

  • LSTM captures non-linear patterns in stock prices, while SARIMAX effectively models volatility trends.
  • SHAP values provide transparency, making the LSTM model more interpretable for traders.
  • The inclusion of Wasserstein Distance introduces topological insights that enhance prediction accuracy.

These results demonstrate the power of hybrid models in capturing both price trends and volatility dynamics.


Conclusion

This study presents a hybrid approach to forecasting stock prices and realized volatility using LSTM, SARIMAX, and Topological Data Analysis. The results show that this framework provides accurate predictions and transparent interpretations, making it a valuable tool for traders and portfolio managers.

Future work could explore additional technical indicators (e.g., RSI, MACD) and incorporate external factors (e.g., macroeconomic variables) to further enhance prediction accuracy.

A deep divemay enhance the HAR model by incorporating Wasserstein Distance (WD) and control variables such as VIX and DXY. This HAR-WD model captures the complex, multi-dimensional influences on stock volatility, offering a robust tool for forecasting and risk management.

A generalization of the Topological Tail Dependence theory: From indices to individual stocks [5]

This article summarizes your LSTM, SARIMAX, SHAP, and TDA-based financial forecasting exercise, showcasing the hybrid framework's strength in capturing both price movements and volatility dynamics.


References

  1. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory.
  2. Hyndman, R.J., & Athanasopoulos, G. (2018). Forecasting: Principles and Practice.
  3. Ripser: Efficient Persistent Homology. Available at: Ripser.
  4. Shapley, L.S. (1953). A Value for n-Person Games.
  5. Hugo Gobato Souto a, Amir Moradi (2024) A generalization of the Topological Tail Dependence theory: From indices to individual stocks


What is Realized Volatility?

Realized volatility is a statistical measure of the actual or historical variability in the returns of a financial asset, such as a stock, over a specific period. It is computed using observed returns, typically from intra-day or daily prices, and provides insight into the extent to which the asset's price fluctuates in reality.

Realized volatility differs from implied volatility, which reflects the market’s expectations of future volatility.


How is Realized Volatility Calculated?

The most common way to calculate realized volatility is by using the standard deviation of returns over a certain period. If we assume daily returns are available, the formula is:

RVt=∑i=1nri2RV_t = \sqrt{\sum_{i=1}^{n} r_{i}^2}RVt=i=1∑nri2

Where:

  • RVtRV_tRVt: Realized volatility for time ttt.
  • rir_iri: Log returns (percentage change) of the asset on day iii.
  • nnn: Number of observations in the period (e.g., 30 days for monthly RV).

In practice, higher-frequency data (such as 5-minute or 15-minute prices) are sometimes used to compute more accurate measures of realized volatility.

要查看或添加评论,请登录

Larry liang的更多文章

社区洞察

其他会员也浏览了