Forecasting Stock Prices and Realized Volatility: A Hybrid Approach Using LSTM, SARIMAX, and Topological Data Analysis
Larry liang
Business Intelligence Engineer@ Costco Wholesale | Data Visualization Expert
Author: Larry Liang
Date: 2024/10/29
Abstract
This paper presents a hybrid approach to stock price and volatility forecasting, integrating machine learning models, traditional statistical techniques, and Topological Data Analysis (TDA). Specifically, we use a Long Short-Term Memory (LSTM) network to predict stock closing prices, a SARIMAX model to forecast realized volatility, and Wasserstein Distance (WD) from TDA to capture topological patterns in price changes. In addition, SHAP (SHapley Additive exPlanations) interpretation is used to enhance the transparency of the LSTM predictions. Our findings highlight the effectiveness of this hybrid framework for both price prediction and volatility forecasting, providing valuable insights for traders and portfolio managers.
Introduction
Financial markets are inherently complex, requiring sophisticated models to forecast stock prices and volatility. Traditional models such as ARIMA and SARIMA rely on time-series trends, while machine learning approaches can capture non-linear dependencies. This study aims to combine the strengths of both paradigms.
Additionally, we introduce Topological Data Analysis (TDA) to quantify persistent homology in price changes, enriching a novel feature set with Wasserstein Distance (WD) values that captures Hidden Market Patterns: WD reveals structural dependencies in market returns that are missed by traditional metrics and enhances Prediction because during Volatile Periods, WD improves the model’s ability to handle nonlinear market behavior. This hybrid framework offers a robust toolset for price forecasting, risk management, and decision-making.
Methodology
Data Collection and Preprocessing
We obtained historical data for Pinduoduo Inc. (PDD) over a period of 5 years from Yahoo Finance. The primary features used include:
The MinMaxScaler was applied to normalize the Close prices, ensuring they fit the input requirements of the LSTM model.
Wasserstein Distance (WD) for Topological Insights
We applied persistent homology using the ripser library to derive Wasserstein Distances (WD) between consecutive days' price changes. WD captures hidden topological structures in the data, which are used as features in the LSTM model.
LSTM for Price Prediction
The LSTM neural network was trained using 3-day sliding windows of scaled prices and WD values. LSTM was chosen for its ability to capture long-term dependencies in sequential data. The predictions were inverse-transformed back to the original scale to ensure interpretability.
# LSTM Model Definition
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=10, batch_size=32)
Auto-Tuned SARIMAX for Realized Volatility Forecasting
We used auto_arima from the pmdarima library to find the optimal parameters for the SARIMAX model. The best parameters were selected based on AIC values, ensuring the best fit for volatility forecasting.
auto_model = auto_arima(
data['pct_change'], start_p=1, max_p=3, start_q=1, max_q=3,
seasonal=True, m=6, start_P=0, max_P=2, start_Q=0, max_Q=2, D=1,
trace=True, stepwise=True
)
best_order = auto_model.order
best_seasonal_order = auto_model.seasonal_order
sarimax_model = SARIMAX(data['pct_change'], order=best_order, seasonal_order=best_seasonal_order)
sarimax_results = sarimax_model.fit()
SHAP Interpretation for Model Transparency
To enhance the interpretability of the LSTM model, we used SHAP values. SHAP identifies the contribution of each input feature to the final prediction, enabling transparent decision-making.
explainer = shap.KernelExplainer(model_predict, shap.sample(X_train_reshaped, 100))
shap_values = explainer.shap_values(X_test_reshaped)
shap.summary_plot(np.array(shap_values).squeeze(-1), X_test_reshaped, feature_names=feature_names)
Results
领英推荐
Predicted Close Prices (Original Scale) for the Next 6 Days
The LSTM predictions reveal minor fluctuations in the stock price over the forecast period, with a slight downward trend toward Day 6. These predictions provide actionable insights for short-term investors looking to plan their entry and exit points.
Forecasted Realized Volatility Using SARIMAX
The SARIMAX model captures volatility swings, with significant dips on Day 4 and a recovery by Day 6. This forecast is essential for risk management, helping investors prepare for potential market instability.
The SHAP summary plot visualizes the impact of the 6 features on the LSTM model's predictions. Each dot on the plot represents the SHAP value for a given feature and sample, showing how much each feature contributes to increasing or decreasing the predicted value.
Features and Their Interpretations:
Key Insights from the SHAP Plot:
Altogether, This interpretability tool has gained widespread recognition in machine learning literature for its ability to provide transparent and comprehensible explanations of model predictions, thereby bridging the gap between advanced machine learning techniques and practical decision-making. [5]
This SHAP analysis provides a transparent view of the LSTM model's predictions, helping us understand which features matter most and why.
Discussion and Insights
The combination of LSTM for price prediction, SARIMAX for volatility forecasting, and Wasserstein Distance from TDA offers a comprehensive framework for financial forecasting.
These results demonstrate the power of hybrid models in capturing both price trends and volatility dynamics.
Conclusion
This study presents a hybrid approach to forecasting stock prices and realized volatility using LSTM, SARIMAX, and Topological Data Analysis. The results show that this framework provides accurate predictions and transparent interpretations, making it a valuable tool for traders and portfolio managers.
Future work could explore additional technical indicators (e.g., RSI, MACD) and incorporate external factors (e.g., macroeconomic variables) to further enhance prediction accuracy.
A deep divemay enhance the HAR model by incorporating Wasserstein Distance (WD) and control variables such as VIX and DXY. This HAR-WD model captures the complex, multi-dimensional influences on stock volatility, offering a robust tool for forecasting and risk management.
This article summarizes your LSTM, SARIMAX, SHAP, and TDA-based financial forecasting exercise, showcasing the hybrid framework's strength in capturing both price movements and volatility dynamics.
References
What is Realized Volatility?
Realized volatility is a statistical measure of the actual or historical variability in the returns of a financial asset, such as a stock, over a specific period. It is computed using observed returns, typically from intra-day or daily prices, and provides insight into the extent to which the asset's price fluctuates in reality.
Realized volatility differs from implied volatility, which reflects the market’s expectations of future volatility.
How is Realized Volatility Calculated?
The most common way to calculate realized volatility is by using the standard deviation of returns over a certain period. If we assume daily returns are available, the formula is:
RVt=∑i=1nri2RV_t = \sqrt{\sum_{i=1}^{n} r_{i}^2}RVt=i=1∑nri2
Where:
In practice, higher-frequency data (such as 5-minute or 15-minute prices) are sometimes used to compute more accurate measures of realized volatility.