Time Series Analysis with SARIMAX, LSTM, and FB Prophet in Python: Commodity Price Forecasting 2023-2024 for Crude Oil and Gold Price
Muhammad Aftab Ahmed
Sr. Data Engineer, Applied AI @ Afiniti | Sr. Business Analyst @ xByco | MBA | PGD - Big Data Analytics | BE
Commodity price forecasting is a challenging yet captivating task that has grabbed the attention of many professionals. However, with the advancement of artificial intelligence, Python's advanced machine learning models can now provide more accurate and insightful predictions. In this article, we focus on forecasting gold prices based on the FED rate and WTI crude oil prices based on the Dollar index for the years 2023-2024 using SARIMAX, Facebook Prophet, and LSTM deep learning time series analysis. This article serves as a comprehensive guide for investors, data scientists, and market analysts who are keen to leverage data-driven insights in the commodity markets.
Topics Covered:
Prerequisites for Time Series Forecasting
Importance of Stationarity in Time Series Forecasting
In time series analysis, stationarity is a fundamental concept, primarily because the underlying assumption of many statistical models is that the statistical properties of the data do not change over time. This implies that parameters such as mean, variance, and autocorrelation remain constant regardless of the time frame
For non-stationary data, the behavior of the data changes over time. This time-dependent structure introduces complexities that make it challenging for a model to learn the underlying patterns accurately. The accuracy of the model can be severely impacted if these underlying assumptions of stationarity are violated, leading to unreliable and unstable forecasts
That's why the first step in time series analysis is often to make the series stationary. Once the series is transformed into a stationary series, we can then use statistical methods to analyze it and create models to predict future values
Decomposing Time Series Data into Trend, Seasonality, and Residuals
Decomposition in time series analysis is the process of breaking down a time series into three constituent parts:
Additive and Multiplicative Seasonality
Seasonality is an important characteristic of time series data, where patterns repeat over regular intervals. There are two main types: additive and multiplicative. It's crucial to understand these types and their distinctions to choose the right model for analyzing time series data.
In an additive time series, the components of the time series (level, trend, seasonality, and noise) add together. The mathematical model for an additive time series is:
y(t) = Level + Trend + Seasonality + Noise
Here:
Level refers to the average value in the series.
Trend represents any measurable pattern of the data over time.
Seasonality corresponds to the repeating short-term cycle in the series.
Noise is the random variation in the series.
A key attribute of additive seasonality is that the magnitude of the seasonal fluctuations, i.e., the difference between the peak and the troughs, does not vary with the time series' level. In other words, the variation around the trend does not change over different time periods. This feature makes additive seasonality particularly suited to time series data where the magnitude of the seasonal pattern in the data does not depend on the level of the time series
In above example, trend is upward with constant magnitude of the seasonal fluctuations therefore this is additive seasonality
Contrastingly, in a multiplicative time series, the components of the time series multiply together. The mathematical model for a multiplicative time series is:
y(t) = Level * Trend * Seasonality * Noise
Multiplicative seasonality is characterized by an increasing or decreasing seasonal amplitude over time, dependent on the time series' level. It is especially useful when the variation in the seasonal pattern increases or decreases as the data values increase or decrease. A critical point to note is that in multiplicative seasonality, the season seems to "widen" as time progresses
In above example, trend is upward with increasing magnitude of the seasonal fluctuations therefore this is multiplicative seasonality
A notable attribute of multiplicative time series data is that they can often be converted into additive time series through transformations such as taking logarithms. This transformation can be beneficial in certain scenarios.
If a seasonal time series shows multiplicative seasonality, it may often be beneficial to transform it into an additive series by taking the log. By taking the logarithm of the data, the multiplicative components convert to additive components, and the variance stabilizes.
This log transformation is particularly useful when the amplitude of the seasonal component is proportional to the level of the series, which leads to an increasing or decreasing seasonal amplitude. The transformation thus helps to stabilize the increasing or decreasing seasonal amplitude, allowing for more straightforward forecasting.
Decomposition of Gold Prices Data from 1970 onwards
In Python, time series decomposition can be done using the seasonal_decompose function provided by the statsmodels library as shown below
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose 'Gold Price (oz/$)'
decomp_results_gold = seasonal_decompose(df['Gold Price (oz/$)'], period=12)
# Plot decomposed data?
plt.rcParams["figure.figsize"] = (10,10)
decomp_results_gold.plot()
plt.title('Decomposition of Gold Price')
plt.show()
Above decomposition of gold prices from 1970 onwards can be eloborate as follows
Thus gold prices are non-stationary due to presence of upward trend and additive seasonality
Decomposition of WTI Crude Oil Prices from 1983 onwards
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
# Decompose 'Gold Price (oz/$)'
decomp_results_gold = seasonal_decompose(df['Crude_Oil_WTI_($/bbl)'], period=12)
# Plot decomposed data?
plt.rcParams["figure.figsize"] = (10,10)
fig = decomp_results_gold.plot()
plt.subplots_adjust(hspace=0.5)? # Adjust the spacing between subplots
plt.suptitle('Decomposition of Crude_Oil_WTI', y=1.02)? # Adjust the main title position
plt.show()
Above decomposition of WTI Crude Oil from 1983 onwards can be eloborate as follows
Thus WTI Crude Oil prices are non-stationary due to presence of weaker upward trend and additive seasonality
Checking Stationarity with KPSS and Augmented Dickey-Fuller Tests
While decomposing data into trend, seasonality, and residuals is one method of checking stationarity, it is not the only approach. Evaluating stationarity requires experience that develops over time.
Therefore, another way to assess stationarity is through statistical tests. These tests are designed to analyze the statistical properties of the data and determine if they remain constant over time. Utilizing statistical tests provides a rigorous and objective means of determining whether the data exhibits stationarity.
Augmented Dickey-Fuller (ADF) Test
The ADF test is a type of unit root test. The "unit root" refers to a scenario where a time series is non-stationary and behaves in a way that's statistically similar to a random walk. This is problematic because a series that resembles a random walk is unpredictable, with sharp shifts and changes that don't adhere to a particular pattern.
In the context of the ADF test, the null hypothesis assumes the presence of a unit root, i.e., the series is non-stationary. If the p-value obtained from the test is less than the significance level (commonly 0.05), we reject the null hypothesis, inferring that the time series is stationary.
Implementing the ADF test in Python is straightforward, using the adfuller function from the statsmodels library
from statsmodels.tsa.stattools import adfuller
print("adfuller test results shown below")
result = adfuller(df['Gold Price (oz/$)'])
print(result)?
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
if result[1] < 0.05:
? ? print("Null hypothesis that data is non-stationary is rejected Thus data is stationary")
else:
? ? print("Null hypothesis that data is non-stationary is accepted Thus data is non-stationary")
for key, value in result[4].items():
?print('\t%s: %.3f' % (key, value))
As P-Value is 0.96 > 0.05, hence Gold Price Data is non-stationary as per ADF Test
from statsmodels.tsa.stattools import adfuller
print("adfuller test results shown below")
result = adfuller(df['Crude_Oil_WTI_($/bbl)'])
print(result)?
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
if result[1] < 0.05:
? ? print("Null hypothesis that data is non-stationary is rejected Thus data is stationary")
else:
? ? print("Null hypothesis that data is non-stationary is accepted Thus data is non-stationary")
for key, value in result[4].items():
?print('\t%s: %.3f' % (key, value))?
As P-Value is 0.18 > 0.05, hence WTI Crude Oil Price Data is non-stationary as per ADF Test
Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test
The KPSS test, in contrast to the ADF test, examines the null hypothesis that a time series is stationary around a deterministic trend (trend-stationarity), as opposed to being non-stationary due to a unit root. This distinction is crucial. A process could be stationary around a deterministic trend (where fluctuations around the trend will always revert to the trend and the variations are relatively stable), which is an entirely different form of non-stationarity than having a unit root.
In the KPSS test, the null hypothesis is that the time series data is stationary (or trend stationary), If the test statistic is greater than the critical value, we reject the null hypothesis, which means the series is not stationary.
To perform a KPSS test in Python, you can use the kpss function from the statsmodels library
def kpss_test(data, **kw)
? ? series = df['Gold Price (oz/$)']
? ? statistic, p_value, n_lags, critical_values = kpss(series, **kw)
? ??
? ? # Format Output
? ? print(f'KPSS Statistic: {statistic}')
? ? print(f'p-value: {p_value}')
? ? print(f'num lags: {n_lags}')
? ? print('Critial Values:')
? ? for key, value in critical_values.items():
? ? ? ? print(f'? ?{key} : {value}')
? ??
? ? if p_value < 0.05:
? ? ? ? print("Result: The series is not stationary")
? ? ? ? print("Null hypothesis that data is stationary is rejected. Thus, data is non-stationary.")
? ? else:
? ? ? ? print("Result: The series is stationary")
? ? ? ? print("Null hypothesis that data is stationary is accepted. Thus, data is stationary.")
? ? ? ??
kpss_test(df):
As P-Value is 0.01 < 0.05, hence Gold Price Data is non-stationary as per KPSS Test as well
def kpss_test(data, **kw)
? ? series = df['Crude_Oil_WTI_($/bbl)']
? ? statistic, p_value, n_lags, critical_values = kpss(series, **kw)
? ??
? ? # Format Output
? ? print(f'KPSS Statistic: {statistic}')
? ? print(f'p-value: {p_value}')
? ? print(f'num lags: {n_lags}')
? ? print('Critial Values:')
? ? for key, value in critical_values.items():
? ? ? ? print(f'? ?{key} : {value}')
? ??
? ? if p_value < 0.05:
? ? ? ? print("Result: The series is not stationary")
? ? ? ? print("Null hypothesis that data is stationary is rejected. Thus, data is non-stationary.")
? ? else:
? ? ? ? print("Result: The series is stationary")
? ? ? ? print("Null hypothesis that data is stationary is accepted. Thus, data is stationary.")
? ? ? ??
kpss_test(df):
As P-Value is 0.01 < 0.05, hence WTL Crude Oil Data is non-stationary as per KPSS Test as well
Making Data Stationary Plotting ACF and PACF Plots
There are several ways to make your time series data stationary, including differencing, seasonal differencing, and transformation
When it comes to mixed differencing, some good rules of thumb are that you should never use more than one order of seasonal differencing, and never more than two orders of differencing in total
Plotting ACF and PACF Plots, finding values of p and q
ACF and PACF plots provide a visual representation of the correlation between points in a time series, based on the time separation, or "lag" between the points. They are essential tools for understanding the properties of the time series data and to identify patterns that can assist in the development of predictive models.
Specifically, these plots can help determine the parameters of an AutoRegressive Integrated Moving Average (ARIMA) model or its seasonal counterpart (SARIMA), which are commonly used models for forecasting in time series analysis
The Autocorrelation Function plot, also known as a Correlogram, gives us the correlation between the time series and its lagged values. It shows us the strength and direction of the linear relationship between observations at different time steps. The ACF plot is used to identify the 'MA(q)' component of an ARIMA model, where 'q' is the highest lag in the data where the ACF plot crosses the significance level.
In Python, the ACF plot can be generated using the plot_acf function from the statsmodels.graphics.tsaplots module. You will need a time series dataset to pass into this function, which will then generate and display the ACF plot.
In above example, the ACF plots tails off at 3 therefore the value of q = 3, MA(3)
On the other hand, the Partial Autocorrelation Function plot provides the partial correlation of a time series with its own lags. It controls for all other lags; the PACF at lag 'k' is the correlation that results after removing the effect of any correlations due to the terms at shorter lags. The PACF plot is used to identify the 'AR(p)' component of an ARIMA model, where 'p' is the highest lag in the data where the PACF plot crosses the significance level.
In Python, the PACF plot can be generated using the plot_pacf function from the statsmodels.graphics.tsaplots module. As with the ACF plot, you will need a time series dataset to pass into this function, which will then generate and display the PACF plot.
In above example, the PACF plots tails off at 2 therefore the value of p = 2, AR(2)
Refer to below table to find p and q values from ACF and PACF plots
Making Gold Price Data Stationary | Plotting ACF and PACF
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import acf, pacf, adfuller, kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from pmdarima.arima import auto_arima
def apply_differencing(df, column='value', seasonality=None):
? ? """
? ? Apply different differencing methods on the given DataFrame.
? ? Parameters:
? ? ? ? - df: Pandas DataFrame
? ? ? ? ? ? The input DataFrame containing the time series data.
? ? ? ? - column: str, optional (default='value')
? ? ? ? ? ? The name of the column to analyze.
? ? ? ? - seasonality: int, optional (default=None)
? ? ? ? ? ? The seasonality period for seasonal differencing.
? ? Returns:
? ? ? ? - differencing_results: dict
? ? ? ? ? ? A dictionary containing the differencing methods and their results.
? ? """
? ? differencing_results = {}
? ? # Convert the column to numeric type
? ? df[column] = pd.to_numeric(df[column], errors='coerce')
? ? # Original data
? ? differencing_results['Original'] = df[column]
? ? # First-order differencing
? ? differencing_results['First Order'] = df[column].diff().dropna()
? ? # Seasonal differencing
? ? if seasonality is not None:
? ? ? ? differencing_results['Seasonal'] = df[column].diff(periods=seasonality).dropna()
? ? # Logarithmic differencing
? ? differencing_results['Log'] = np.log(df[column]).diff().dropna()
? ? # Plot differencing results
? ? fig, axes = plt.subplots(len(differencing_results), 1, figsize=(15, 5 * len(differencing_results)))
? ? for i, (method, series) in enumerate(differencing_results.items()):
? ? ? ? axes[i].plot(series)
? ? ? ? axes[i].set_ylabel(column)
? ? ? ? axes[i].set_xlabel('Date')
? ? ? ? axes[i].set_title(f'Differencing Method: {method}')
? ? plt.tight_layout()
? ? plt.show()
? ? # Perform ADF and KPSS tests
? ? for method, series in differencing_results.items():
? ? ? ? print(f'\n##### {method} Differencing #####')
? ? ? ? print('\nADF Test:')
? ? ? ? perform_adf_test(series)
? ? ? ? print('\nKPSS Test:')
? ? ? ? perform_kpss_test(series)
? ? ? ? print('\nOptimal ARMA Model:')
? ? ? ? find_optimal_arma(series, method)
? ? return differencing_results
def perform_adf_test(series):
? ? result = adfuller(series)
? ? print('ADF Statistic: %f' % result[0])
? ? print('p-value: %f' % result[1])
? ? if result[1] < 0.05:
? ? ? ? print("ADF Test: The series appears to be stationary")
? ? else:
? ? ? ? print("ADF Test: The series appears to be non-stationary")
def perform_kpss_test(series):
? ? statistic, p_value, n_lags, critical_values = kpss(series)
? ? print(f'KPSS Statistic: {statistic}')
? ? print(f'p-value: {p_value}')
? ? print(f'num lags: {n_lags}')
? ? if p_value < 0.05:
? ? ? ? print("KPSS Test: The series appears to be non-stationary")
? ? else:
? ? ? ? print("KPSS Test: The series appears to be stationary")
def find_optimal_arma(series, method):
? ? stepwise_model = auto_arima(series, start_p=0, start_q=0,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?max_p=3, max_q=3, m=1,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?seasonal=False,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?suppress_warnings=True,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?trace=False,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?error_action='ignore',
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?stepwise=True)
? ? best_order = stepwise_model.order
? ? print(f'{method} Differencing - Optimal ARMA(p,q): {best_order}')
# Load the dataset
df = pd.read_csv('Historical_Data_XAU_USD.csv', index_col='Date', parse_dates=True)
# Remove commas and convert Gold Price to float
df['Gold Price (oz/$)'] = df['Gold Price (oz/$)'].str.replace(',', '').astype(float)
# Apply differencing methods
differencing_results = apply_differencing(df, column='Gold Price (oz/$)', seasonality=12)
Making WTL Crude Oil Data Stationary | Plotting ACF and PACF
import pandas as p
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import acf, pacf, adfuller, kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from pmdarima.arima import auto_arima
def apply_differencing(df, column='value', seasonality=None):
? ? """
? ? Apply different differencing methods on the given DataFrame.
? ? Parameters:
? ? ? ? - df: Pandas DataFrame
? ? ? ? ? ? The input DataFrame containing the time series data.
? ? ? ? - column: str, optional (default='value')
? ? ? ? ? ? The name of the column to analyze.
? ? ? ? - seasonality: int, optional (default=None)
? ? ? ? ? ? The seasonality period for seasonal differencing.
? ? Returns:
? ? ? ? - differencing_results: dict
? ? ? ? ? ? A dictionary containing the differencing methods and their results.
? ? """
? ? differencing_results = {}
? ? # Convert the column to numeric type
? ? df[column] = pd.to_numeric(df[column], errors='coerce')
? ? # Original data
? ? differencing_results['Original'] = df[column]
? ? # First-order differencing
? ? differencing_results['First Order'] = df[column].diff().dropna()
? ? # Seasonal differencing
? ? if seasonality is not None:
? ? ? ? differencing_results['Seasonal'] = df[column].diff(periods=seasonality).dropna()
? ? # Logarithmic differencing
? ? differencing_results['Log'] = np.log(df[column]).diff().dropna()
? ? # Plot differencing results
? ? fig, axes = plt.subplots(len(differencing_results), 1, figsize=(15, 5 * len(differencing_results)))
? ? for i, (method, series) in enumerate(differencing_results.items()):
? ? ? ? axes[i].plot(series)
? ? ? ? axes[i].set_ylabel(column)
? ? ? ? axes[i].set_xlabel('Date')
? ? ? ? axes[i].set_title(f'Differencing Method: {method}')
? ? plt.tight_layout()
? ? plt.show()
? ? # Perform ADF and KPSS tests
? ? for method, series in differencing_results.items():
? ? ? ? print(f'\n##### {method} Differencing #####')
? ? ? ? print('\nADF Test:')
? ? ? ? perform_adf_test(series)
? ? ? ? print('\nKPSS Test:')
? ? ? ? perform_kpss_test(series)
? ? ? ? print('\nOptimal ARMA Model:')
? ? ? ? find_optimal_arma(series, method)
? ? return differencing_results
def perform_adf_test(series):
? ? result = adfuller(series)
? ? print('ADF Statistic: %f' % result[0])
? ? print('p-value: %f' % result[1])
? ? if result[1] < 0.05:
? ? ? ? print("ADF Test: The series appears to be stationary")
? ? else:
? ? ? ? print("ADF Test: The series appears to be non-stationary")
def perform_kpss_test(series):
? ? statistic, p_value, n_lags, critical_values = kpss(series)
? ? print(f'KPSS Statistic: {statistic}')
? ? print(f'p-value: {p_value}')
? ? print(f'num lags: {n_lags}')
? ? if p_value < 0.05:
? ? ? ? print("KPSS Test: The series appears to be non-stationary")
? ? else:
? ? ? ? print("KPSS Test: The series appears to be stationary")
def find_optimal_arma(series, method):
? ? stepwise_model = auto_arima(series, start_p=0, start_q=0,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?max_p=3, max_q=3, m=1,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?seasonal=False,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?suppress_warnings=True,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?trace=False,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?error_action='ignore',
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?stepwise=True)
? ? best_order = stepwise_model.order
? ? print(f'{method} Differencing - Optimal ARMA(p,q): {best_order}')
# Load the dataset
df = pd.read_csv('Crude_Oil_WTI.csv', index_col='Date', parse_dates=True)
# Apply differencing methods
differencing_results = apply_differencing(df, column='Crude_Oil_WTI_($/bbl)', seasonality=12)
d
After applying appropriate differencing to make the dataset stationary, the next step in our time series analysis is to plot the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots. These plots help us identify the optimal values for the parameters 'p' and 'q' in the ARIMA model.
To make this part interactive and engaging, I encourage you to participate by posting your observations and thoughts on the ACF and PACF plots in the comment section below. Together, we can confirm the most suitable values for 'p' and 'q' and continue our exciting journey into time series forecasting.
Exploring the ARIMA Spectrum: AR | MA | ARMA | ARIMA | SARIMA and SARIMAX
ARIMA modeling is a key technique in time series analysis. Founded on mid-20th-century statistical theory, it's designed to predict future data points by understanding the past. ARIMA models are built on ideas of stationarity, differencing, and autocorrelation, aiming to identify and understand any trends or patterns in the time series data
This makes them great for working on data with clear trends or seasonality, like stock prices or weather patterns. Because of their adaptability, ARIMA models have found use in a wide range of fields, from finance to engineering, making them an important tool in data analysis. This versatile and dependable modeling approach continues to be popular in the data science community
Autoregressive (AR) Models
AR models are based on the concept of autocorrelation. Autocorrelation is a statistical property where a variable is correlated with a lagged version of itself. In AR models, we assume that the current value of our time series can be predicted from its previous values, or lags. The number of lags used as predictors is called the order of the AR model. For example, an AR(1) model would predict the current value based on the immediately preceding value, while an AR(2) model would use the last two values, and so forth.
Moving Average (MA) Models
MA models, on the other hand, do not use past values of the time series as predictors. Instead, they use a series of past forecast errors in a "moving window" as input variables. The idea is that if a random disturbance causes a fluctuation at one point, it's likely to create similar disturbances in the near future. A model order is again defined for an MA model - for instance, MA(1) uses the error at the immediately preceding time step to predict the current value.
Autoregressive Moving Average (ARMA) Models
ARMA models are simply a combination of AR and MA models. They use both past observations of the time series and past forecast errors to predict future values. In other words, they blend the idea that the present is a function of the past (AR) and that errors made in the past propagate into the future (MA).
领英推荐
Autoregressive Integrated Moving Average (ARIMA) Models
ARIMA models extend ARMA models by considering the time series' order of differencing to make it stationary (i.e., removing trends and seasonality from the data). The "integrated" in ARIMA refers to this differencing process, indicating that the model uses differences of the time series at different lags to predict future values. An ARIMA model is usually noted as ARIMA(p,d,q), where p is the order of the AR part, d is the order of differencing, and q is the order of the MA part.
Seasonal Autoregressive Integrated Moving Average (SARIMA) Models
SARIMA models extend ARIMA by adding a seasonal difference to handle seasonal components in the time series. In other words, besides considering dependencies at previous time steps and past forecast errors (like ARIMA), they also account for patterns that repeat seasonally. These models are often used for series that show a clear seasonal pattern, such as electricity consumption, tourist arrivals, and monthly sales data.
Seasonal Autoregressive Integrated Moving Average with Explanatory Variable (SARIMAX) Models
Finally, SARIMAX extends SARIMA by incorporating an external or exogenous variable, which might improve the model's performance. Exogenous variables are those that are not part of the time series but have a relationship with it. For example, in predicting the sales of a soda company, the average temperature can be an exogenous variable.
Facebook's Prophet Model | Open Source Library for Time Series Forecasting
Facebook Prophet was developed by Facebook's Core Data Science team and open-sourced in 2017. It is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. The model was designed to handle the common features seen in business time series, such as seasonality, trends, and holidays.
Prophet is robust to missing data and shifts in the trend, and typically handles outliers well, making it more flexible in handling various kinds of 'real-world' data. Its flexibility and ease of use — it doesn't require complex configuration and has automatic feature engineering for time-based components — have led to its increasing popularity in time series forecasting.
Components | Trend, Seasonality, and Holidays in Prophet Architecture
The Prophet model uses an additive regression model, incorporating four primary components: a piecewise linear or logistic growth curve trend, a yearly seasonal component modeled using Fourier series, a weekly seasonal component using dummy variables, and a user-provided list of important holidays.
Changes in trends are automatically detected and defined as change points in the data, while the seasonal components are modeled with Fourier series to allow complex seasonal patterns. Holidays are modeled with individual dummy variables, allowing for versatile modeling of different holiday effects.
Optimization Techniques in Prophet: Newton-Raphson and L-BFGS
Optimization in Prophet is executed to fit the model parameters, including trend and seasonality components. Two significant optimizers employed in Prophet are the Newton-Raphson and Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) methods
L-BFGS: Complementing the Newton-Raphson method in Prophet is the Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) optimizer. This method is a memory-efficient variant of the standard BFGS algorithm, using gradient information to identify the minimum of the function. L-BFGS becomes particularly beneficial when dealing with a large number of parameters, as it significantly reduces both computational complexity and memory requirements, thereby enhancing the model's efficiency and scalability.
Data Preprocessing: Handling Outliers, Missing Data and Scaling in Prophet
Prophet has been designed to handle missing data and outliers. For missing data, as long as there are not large gaps, Prophet will provide a forecast by fitting the trend and seasonality.
Outliers can be removed from the training data, or you can adjust their values before fitting the model. However, due to the robustness of the Prophet model, often these steps might not be necessary.
In terms of scaling, Prophet does not require the input data to be stationary, and it also does not require any specific scaling of the input data. This makes Prophet easy to use on many types of time series data with minimal preprocessing.
Data Forecasting: Intervals, Horizon and Model Evaluation in Prophet
In Prophet, you can set the forecasting horizon by specifying the number of periods and the frequency (e.g., days, weeks) of the forecasts. This provides flexibility in generating forecasts for different future periods.
Prophet also provides functionality to include uncertainty intervals in the forecasts, which provide a range for the future values instead of a single prediction. These intervals are based on the uncertainty in the trend and seasonality, and do not include any uncertainty from additional regressors.
To evaluate the forecasting performance, Prophet provides functionality to perform cross-validation to assess the accuracy of the model's forecasts. This is done by selecting cutoff points in the history, and for each of them, fitting the model using data only up to that cutoff point, and then comparing the forecast to the actual values.
Long Short-Term Memory (LSTM) | Deep Learning Neural Network
Long Short-Term Memory (LSTM) was introduced by Sepp Hochreiter and Jurgen Schmidhuber in 1997 as a solution to the vanishing gradient problem in training traditional Recurrent Neural Networks (RNNs). The vanishing gradient problem occurs during training when gradients, computed during the backpropagation, get smaller and smaller as they are passed backwards through time. As a result, the weights in the earlier layers of the network are updated to a lesser extent, hindering the model's learning process.
Traditional RNNs excel in tasks where input and output sequences are synchronous and of the same length, such as in translation tasks. However, when dealing with tasks that involve longer sequences with possible long-term dependencies, the performance of traditional RNNs significantly degrades, mainly due to the aforementioned vanishing gradient problem.
LSTM was designed to remember information for long periods, thus overcoming the limitations of traditional RNNs, making it more suitable for tasks involving longer sequences, like time series forecasting, text generation, and more.
Cell State | Forget, Input and Output Gates of LSTM Architecture
LSTMs contain information outside the normal flow of the recurrent network in a 'cell state'. Much like computer memory, the cell state stores and recalls information. It can read, write, and erase information, with the amount of information to retain or forget determined by structures called 'gates'.
Gates are a way to optionally let information through. They are composed of a sigmoid activation layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through.
An LSTM has three of these gates:
Anatomy of a Neuron: Structure and Function
Neurons are fundamental units of computation in artificial neural networks, including LSTM (Long Short-Term Memory) models used for time series forecasting. Understanding the components of a neuron and their role in LSTM can provide valuable insights into how these models process sequential data.
In LSTM models, the dendrites receive sequential input data, the cell body processes and integrates these inputs, and the axon carries the output signals. This sequential processing capability of LSTM networks makes them well-suited for capturing and analyzing temporal patterns in time series data. By effectively leveraging the interactions between dendrites, the cell body, and the axon, LSTM models can make accurate predictions and forecasts in time series forecasting tasks.
Optimizers in Neural Networks: Adam and RMSprop
The optimizer's role is crucial as it shapes and molds your model into its most accurate possible form by manipulating the weights. The loss function is the guide to the terrain, telling the optimizer when it’s moving in the right or wrong direction.
In terms of when to use which, it usually depends on the specific problem and the nature of the dataset. Adam is often a good starting point as it combines the advantages of other optimizers.
Dropout Rate, Batch Size and Epochs in LSTM Modelling
Dropout is a technique used to prevent overfitting. Dropout works by probabilistically removing, or “dropping out,” inputs to a layer, which may be input variables in the data sample or activations from a previous layer. It has the effect of simulating a large number of networks with very different network structures and, in turn, making nodes in the network generally more robust to the inputs
The batch size determines the number of examples you look at before making a weight update. The lower it is, the noisier the training signal is going to be, the higher it is, the longer it will take to compute the gradient for each step
An epoch is a term used in machine learning and indicates the number of passes of the entire training dataset the machine learning algorithm has completed. If the batch size is the whole dataset, then each epoch is equivalent to one iteration. The number of epochs is a hyperparameter you can tune. If you set it too small, your model might still have room to improve. If you set it too large, then you're just wasting computational resources because your model has already converged.
Decoding the Correlation: Gold Prices and the Federal Reserve Rate
Gold, a prominent symbol of wealth and prosperity, has been at the heart of global economics for centuries. Its price, which has seen various fluctuations over time, is influenced by a myriad of factors, one being the Federal Reserve (Fed) rate. The association between gold prices and the Fed rate has morphed over time, particularly after the United States decided to sever the link between the dollar and gold.
The U.S. dollar was tied to gold under the gold standard system. In this framework, the dollar's value was directly connected to a specified amount of gold. The gold standard had a critical influence on gold prices, essentially stabilizing them.
However, in 1971, President Richard Nixon made a historic decision to end the gold standard. This event, known as the 'Nixon Shock,' marked the decoupling of the dollar from gold. The primary reason for this shift was to curb the outflow of gold and stabilize the U.S. economy, which was grappling with increasing inflation and unemployment.
The decoupling spurred significant changes in the global economic landscape. For gold, it allowed prices to float freely in the market, influenced by supply and demand. Consequently, gold transitioned from a relatively stable asset to a more volatile one. The event ignited a bullish market for gold that lasted for almost a decade, with gold prices soaring from $35 an ounce to a peak of $850 in 1980.
In times of economic stability and growth, when the Fed raises rates, the returns on bonds and other fixed-income investments tend to rise. This scenario makes gold, a non-interest bearing asset, less attractive, leading to a drop in its price. Conversely, in an economic downturn, when the Fed slashes rates, gold becomes an appealing investment, driving up its price.
However, other factors, such as geopolitical events, inflation expectations, and global market volatility, can also significantly impact gold prices. For instance, even in a high-interest-rate environment, gold prices might surge due to geopolitical tensions, as investors flock to gold as a 'safe-haven' asset.
Although currently, gold prices depend on many factors, primarily geopolitical events and economic uncertainty, from a Machine Learning Modeling perspective, the single most suitable parameter is still the Federal Reserve (Fed) rate, considering ceteris paribus.
Petro-Dollar and WTI Crude Oil Prices
The interplay between WTI crude oil prices and the Dollar Index, an indicator of the U.S. dollar's value against major currencies, carries profound implications in the global energy market. Understanding the historical context and the significance of the petro-dollar system sheds light on the complex relationship between these two influential factors.
The petro-dollar system, born in the 1970s, solidified the link between WTI crude oil prices and the U.S. dollar. Following the oil crisis, major oil-producing nations agreed to price oil exclusively in U.S. dollars, cementing the dollar's dominance as the preferred currency for global oil transactions. This arrangement created a symbiotic relationship, as countries needing oil had to maintain a steady supply of dollars, strengthening the currency's value.
One key consequence of the petro-dollar system is the inverse relationship between WTI crude oil prices and the Dollar Index. When the Dollar Index strengthens, reflecting a robust U.S. economy, WTI crude oil prices tend to face downward pressure. This occurs because a stronger dollar makes oil more expensive for buyers using other currencies, dampening demand and contributing to lower prices.
While the petro-dollar system and the Dollar Index play a significant role in WTI crude oil price movements, it is crucial to acknowledge that other factors influence oil prices as well. Geopolitical events, global supply and demand dynamics, weather disruptions, and production fluctuations are just a few examples of additional factors that can exert independent effects on WTI crude oil prices.
In the realm of Machine Learning Modeling, the Dollar Index remains a vital parameter for analyzing and forecasting WTI crude oil prices. By incorporating the Dollar Index as a feature in predictive models, analysts can enhance their ability to capture the nuanced relationship between these variables and improve the accuracy of oil price predictions.
Gold Price Forecasting 2023-2024 | Python SARIMAX , Facebook Prophet and LSTM Deep Learning Modelling
Python SARIMAX ML Code, Gold Price Forecasting 2023-2024
Lets first visualise input dataframe
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
import warnings
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import ParameterSampler
import matplotlib.cm as cm
import matplotlib.gridspec as gridspec
warnings.filterwarnings('ignore')? # Ignore warning messages
# Load and preprocess the data
df = pd.read_csv('Historical_Data_XAU_USD.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df['Gold Price (oz/$)'] = df['Gold Price (oz/$)'].str.replace(',', '').astype(float)
df['FED Rate'] = df['FED Rate'].astype(float)
# Define a range of hyperparameters to test
param_dist = {"order": [(p, d, q) for p in range(3) for d in [0, 1, 2] for q in range(3)],
? ? "seasonal_order": [(P, D, Q, 12) for P in range(3) for D in [0, 1, 2] for Q in range(3)]}
# Initialize a random parameter sampler
param_list = list(ParameterSampler(param_dist, n_iter=50))
# Define Time Series cross-validator
tscv = TimeSeriesSplit(n_splits=5)
best_aic = np.inf
best_bic = np.inf
best_params = None
# Cross-validation and hyperparameter tuning
for params in param_list:
? ? aic_scores = []
? ? bic_scores = []
? ? # Cross-validation loop
? ? for train_index, test_index in tscv.split(df):
? ? ? ? cv_train, cv_test = df.iloc[train_index], df.iloc[test_index]
? ? ? ? # Fit SARIMAX model
? ? ? ? sarima = SARIMAX(cv_train['Gold Price (oz/$)'], exog=cv_train['FED Rate'],
? ? ? ? ? ? ? ? ? ? ? ? ?order=params['order'], seasonal_order=params['seasonal_order'])
? ? ? ? sarima_results = sarima.fit()
? ? ? ? # Save AIC and BIC scores
? ? ? ? aic_scores.append(sarima_results.aic)
? ? ? ? bic_scores.append(sarima_results.bic)
? ? # Average AIC and BIC scores
? ? mean_aic = np.mean(aic_scores)
? ? mean_bic = np.mean(bic_scores)
? ? # Update best parameters if current model has a lower AIC and BIC
? ? if mean_aic < best_aic and mean_bic < best_bic:
? ? ? ? best_aic = mean_aic
? ? ? ? best_bic = mean_bic
? ? ? ? best_params = params
# Print the best parameters and the corresponding AIC and BIC
print('Best parameters:', best_params)
print('Best AIC:', best_aic)
print('Best BIC:', best_bic)
# Fit the best model
try:
? ? best_model = SARIMAX(df['Gold Price (oz/$)'], exog=df['FED Rate'],
? ? ? ? ? ? ? ? ? ? ? ? order=best_params['order'], seasonal_order=best_params['seasonal_order'])
? ? best_model_fit = best_model.fit()
except LinAlgError:
? ? print(f"Best model parameters: {best_params} cause a LinAlgError")
# Define future FED Rates
future_fed_rates = [df['FED Rate'].iloc[-1]] + [5.3]*6 + [5.6]*6 + [5.3]*3 +? [5.06]*2
# Create forecast object
forecast_object = best_model_fit.get_forecast(steps=18, exog=np.array(future_fed_rates).reshape(-1, 1))
# Extract prediction mean
mean = forecast_object.predicted_mean
# Extract the forecast dates
datescf = pd.date_range(start=df.index[-1], periods=18, freq='MS')
# Predictions on historical data
in_sample_forecast = best_model_fit.get_prediction(start=pd.to_datetime('1970-03-01'),?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?end=df.index[-1],?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?dynamic=False,?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?exog=df['FED Rate'][df.index >= pd.to_datetime('1970-03-01')])
in_sample_mean = in_sample_forecast.predicted_mean
# Confidence intervals for the in-sample data
in_sample_conf_int = in_sample_forecast.conf_int()
# Correct any negative lower confidence levels in the in-sample data
in_sample_conf_int[in_sample_conf_int.iloc[:, 0] < 0] = 0
# Confidence intervals for the out-of-sample data
conf_int = best_model_fit.get_forecast(steps=len(future_fed_rates), exog=future_fed_rates).conf_int()
# Correct any negative lower confidence levels in the out-of-sample data
conf_int[conf_int.iloc[:, 0] < 0] = 0
Python SARIMAX ML Result, Gold Price Forecasting 2023-2024
Python FB Prophet Code, Gold Price Forecasting 2023-2024
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import ParameterSampler
from prophet import Prophet
from prophet.diagnostics import cross_validation, performance_metrics
import matplotlib.pyplot as plt
from scipy.stats import uniform
df = pd.read_csv('Historical_Data_XAU_USD.csv', index_col='Date', parse_dates=True)
# Reset index to have Date as a column
df.reset_index(inplace=True)
# Convert data types
df['Gold Price (oz/$)'] = df['Gold Price (oz/$)'].str.replace(',', '').astype(float)
df['FED Rate'] = df['FED Rate'].astype(float)
# Make Gold Price stationary
df['Gold Price (oz/$)_diff'] = df['Gold Price (oz/$)'].diff()
df.dropna(inplace=True)
# Create a scaler for Gold Price
price_scaler = MinMaxScaler()
df['Gold Price (oz/$)_diff_scaled'] = price_scaler.fit_transform(df[['Gold Price (oz/$)_diff']])
# Normalize the FED Rate
scaler = MinMaxScaler()
df['FED Rate'] = scaler.fit_transform(df[['FED Rate']])
# Prepare dataframe for Prophet
df_prophet = pd.DataFrame({
? ? 'ds': df['Date'],
? ? 'y': df['Gold Price (oz/$)_diff_scaled'].values,
? ? 'FED Rate': df['FED Rate'].values
})
# Define the parameter space
param_space = {
? ? 'changepoint_prior_scale': uniform(0.01, 0.5),
? ? 'seasonality_prior_scale': uniform(0.1, 10.0),
? ? 'changepoint_range': uniform(0.8, 0.1),
? ? 'seasonality_mode': ['additive', 'multiplicative'],
? ? 'daily_seasonality': [True, False],
? ? 'weekly_seasonality': [True, False],
? ? 'yearly_seasonality': [True, False]
}
# Define number of iterations for random search
n_iter = 50
# Perform random search
tuning_results = []
for params in ParameterSampler(param_space, n_iter=n_iter):
? ? model = Prophet(**params)
? ? model.add_regressor('FED Rate')
? ? model.fit(df_prophet)
? ??
? ? # Perform cross-validation
? ? cv_results = cross_validation(model, initial='3650 days', period='180 days', horizon='365 days')
? ? df_p = performance_metrics(cv_results)
? ? metrics = {}
? ? try:
? ? ? ? metrics['mape'] = df_p['mape'].mean()
? ? except KeyError:
? ? ? ? pass
? ? try:
? ? ? ? metrics['mae'] = df_p['mae'].mean()
? ? except KeyError:
? ? ? ? pass
? ? try:
? ? ? ? metrics['rmse'] = df_p['rmse'].mean()
? ? except KeyError:
? ? ? ? pass
? ? tuning_results.append({
? ? ? ? 'params': params,
? ? ? ? 'metrics': metrics,
? ? })
# Find the best parameters
best_params = min(tuning_results, key=lambda x: x['metrics'].get('mape', float('inf')))['params']
# Print the best parameters
print('Best Parameters:')
print(best_params)
# Use best parameters to predict future values
model = Prophet(**best_params)
model.add_regressor('FED Rate')
model.fit(df_prophet)
# Prepare future data
future_data = pd.DataFrame({
? ? 'ds': pd.date_range(start='2023-08-01', periods=17, freq='M'),
? ? 'FED Rate': [5.3, 5.3 ,5.3, 5.3, 5.3, 5.3, 5.6 , 5.6 , 5.6 , 5.6 , 5.6 , 5.6 , 5.3, 5.3, 5.3, 5.06 , 5.06]})
# Normalize FED Rate
future_data['FED Rate'] = scaler.transform(future_data[['FED Rate']])
# Predict
forecast = model.predict(future_data)
# Denormalize 'yhat'
forecast['yhat'] = price_scaler.inverse_transform(forecast[['yhat']])
forecast['yhat_lower'] = price_scaler.inverse_transform(forecast[['yhat_lower']])
forecast['yhat_upper'] = price_scaler.inverse_transform(forecast[['yhat_upper']])
# Undifference the predictions
last_value = df.iloc[-1]['Gold Price (oz/$)']
forecast['yhat'] = np.r_[last_value, forecast['yhat'].values].cumsum()[1:]
forecast['yhat_lower'] = np.r_[last_value, forecast['yhat_lower'].values].cumsum()[1:]
forecast['yhat_upper'] = np.r_[last_value, forecast['yhat_upper'].values].cumsum()[1:]
# Print predicted gold prices
print("Predicted Gold Prices:")
print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']])d
Python FB Prophet Result, Gold Price Forecasting 2023-2024
Python LSTM DL Code, Gold Price Forecasting 2023-2024
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
df = pd.read_csv('Historical_Data_XAU_USD.csv', index_col='Date', parse_dates=True)
# Reset index to have Date as a column
df.reset_index(inplace=True)
# Engineer features from 'Date'
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
# Convert data types
df['Gold Price (oz/$)'] = df['Gold Price (oz/$)'].str.replace(',', '').astype(float)
df['FED Rate'] = df['FED Rate'].astype(float)
# Create a separate scaler for Gold Price
price_scaler = MinMaxScaler()
df['Gold Price (oz/$)_diff'] = df['Gold Price (oz/$)'].diff()
df.dropna(inplace=True)
df['Gold Price (oz/$)_diff_scaled'] = price_scaler.fit_transform(df[['Gold Price (oz/$)_diff']])
# Normalize the other data
scaler = MinMaxScaler()
df[['Year', 'Month', 'FED Rate']] = scaler.fit_transform(df[['Year', 'Month', 'FED Rate']])
# Prepare the data for LSTM
window_size = 10
features_count = df[['Year', 'Month', 'FED Rate']].shape[1]
X = []
y = []
for i in range(len(df) - window_size):
? ? X.append(df.iloc[i:i+window_size, [3, 4, 5]].values)
? ? y.append(df.iloc[i+window_size]['Gold Price (oz/$)_diff_scaled'])
X = np.array(X)
y = np.array(y)
# Split the data into train and test sets
train_size = int(0.8 * len(X))
train_X, test_X = X[:train_size], X[train_size:]
train_y, test_y = y[:train_size], y[train_size:]
# Define the LSTM model
def create_model(neurons=50, optimizer='adam', dropout_rate=0.0):
? ? model = Sequential()
? ? model.add(LSTM(neurons, activation='relu', return_sequences=True, input_shape=(window_size, features_count)))
? ? model.add(Dropout(dropout_rate))
? ? model.add(LSTM(neurons, activation='relu'))
? ? model.add(Dropout(dropout_rate))
? ? model.add(Dense(1))
? ? model.compile(optimizer=optimizer, loss='mean_squared_error')
? ? return model
# Create KerasRegressor
model = KerasRegressor(build_fn=create_model, verbose=0)
# Define hyperparameters to tune
param_dist = {
? ? 'neurons': [50, 100, 150, 200],
? ? 'optimizer': ['adam', 'rmsprop'],
? ? 'dropout_rate': [0.0, 0.1, 0.2, 0.3],
? ? 'batch_size': [16, 32, 64, 128],
? ? 'epochs': [10, 20, 30]
}
# Perform random search
random = RandomizedSearchCV(estimator=model, param_distributions=param_dist, scoring='neg_mean_squared_error', n_iter=10)
random_result = random.fit(train_X, train_y)
# Get the best model
best_model = random_result.best_estimator_
best_params = random_result.best_params_
print("Best Model Parameters:")
print(best_params)
# Predictions on test set and MSE calculation
test_y_pred = random_result.predict(test_X)
print('Test MSE:', mean_squared_error(test_y, test_y_pred))
# denormalize the 'FED Rate'
df['FED Rate'] = scaler.inverse_transform(df[['Year', 'Month', 'FED Rate']])[:, 2]
# Prepare future data
future_data = np.array([
? ? [2023, 8, 5.3],
? ? [2023, 9, 5.3],
? ? [2023, 10, 5.3],
? ? [2023, 11, 5.3],
? ? [2023, 12, 5.3],
? ? [2024, 1, 5.6],
? ? [2024, 2, 5.6],
? ? [2024, 3, 5.6],
? ? [2024, 4, 5.6],
? ? [2024, 5, 5.6],
? ? [2024, 6, 5.6],
? ? [2024, 7, 5.3],
? ? [2024, 8, 5.3],
? ? [2024, 9, 5.3],
? ? [2024, 10, 5.06],
? ? [2024, 11, 5.06],
? ? [2024, 12, 5.06]
])
from datetime import timedelta
# Assuming that last_date is the last date in your original data
last_date = df['Date'].iloc[-1]
# Generate future dates
future_dates = pd.date_range(last_date, periods=len(future_data[:, 2]) + 1, freq='MS')[1:]
# Normalize future data with the same scaler used for training data
future_data_scaled = scaler.transform(future_data)
# Prepare sequences for prediction
future_X = []
for i in range(len(future_data_scaled)):
? ? sequence = np.concatenate((df.iloc[-window_size + i:, [3, 4, 5]].values, future_data_scaled[:i + 1, :]), axis=0)
? ? future_X.append(sequence[-window_size:])
future_X = np.array(future_X)
# Make predictions on the future data
future_y_pred = best_model.predict(future_X)
# Transform future_y_pred to original scale
future_y_pred = price_scaler.inverse_transform(future_y_pred.reshape(-1, 1))
# Undifference the predictions
last_value = df.iloc[-1]['Gold Price (oz/$)']
future_y_pred = np.r_[last_value, future_y_pred.flatten()].cumsum()[1:]
# Print predicted gold prices
print("Predicted Gold Prices:")
print(future_y_pred)
Python LSTM DL Result, Gold Price Forecasting 2023-2024
WTI Crude Oil Price Forecasting 2023-2024 | Python SARIMAX , Facebook Prophet and LSTM Deep Learning Modelling
Python SARIMAX ML Code, Crude Oil Forecasting 2023-2024
Lets first visualise input dataframe
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
import warnings
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import ParameterSampler
import matplotlib.cm as cm
import matplotlib.gridspec as gridspec
warnings.filterwarnings('ignore')? # Ignore warning messages
# Load and preprocess the data
df = pd.read_csv('Crude_Oil_WTI.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
# Define a range of hyperparameters to test
param_dist = {"order": [(p, d, q) for p in range(3) for d in [0, 1, 2] for q in range(3)],
? ? "seasonal_order": [(P, D, Q, 12) for P in range(3) for D in [0, 1, 2] for Q in range(3)]}
# Initialize a random parameter sampler
param_list = list(ParameterSampler(param_dist, n_iter=50))
# Define Time Series cross-validator
tscv = TimeSeriesSplit(n_splits=5)
best_aic = np.inf
best_bic = np.inf
best_params = None
# Cross-validation and hyperparameter tuning
for params in param_list:
? ? aic_scores = []
? ? bic_scores = []
? ? # Cross-validation loop
? ? for train_index, test_index in tscv.split(df):
? ? ? ? cv_train, cv_test = df.iloc[train_index], df.iloc[test_index]
? ? ? ? # Fit SARIMAX model
? ? ? ? sarima = SARIMAX(cv_train['Crude_Oil_WTI_($/bbl)'], exog=cv_train['US Dollar Index'],
? ? ? ? ? ? ? ? ? ? ? ? ?order=params['order'], seasonal_order=params['seasonal_order'])
? ? ? ? sarima_results = sarima.fit()
? ? ? ? # Save AIC and BIC scores
? ? ? ? aic_scores.append(sarima_results.aic)
? ? ? ? bic_scores.append(sarima_results.bic)
? ? # Average AIC and BIC scores
? ? mean_aic = np.mean(aic_scores)
? ? mean_bic = np.mean(bic_scores)
? ? # Update best parameters if current model has a lower AIC and BIC
? ? if mean_aic < best_aic and mean_bic < best_bic:
? ? ? ? best_aic = mean_aic
? ? ? ? best_bic = mean_bic
? ? ? ? best_params = params
# Print the best parameters and the corresponding AIC and BIC
print('Best parameters:', best_params)
print('Best AIC:', best_aic)
print('Best BIC:', best_bic)
# Fit the best model
best_model = SARIMAX(df['Crude_Oil_WTI_($/bbl)'], exog=df['US Dollar Index'],
? ? ? ? ? ? ? ? ? ? ?order=best_params['order'], seasonal_order=best_params['seasonal_order'])
best_model_fit = best_model.fit()
# Define future US Dollar Index
future_us_dollar_index = [101.64, 103.06, 103.66, 104.33, 103.57, 102.59, 99.42, 99.48, 101.97, 102.85, 103.33, 102.4,?
? ? ? ? ? ? ? ? ? ? ? ? ? 101.23, 100.08, 99.78, 98.72, 97.62, 98.34]
# Create forecast object
forecast_object = best_model_fit.get_forecast(steps=18, exog=np.array(future_us_dollar_index).reshape(-1, 1),alpha=0.2)
# Extract prediction mean
mean = forecast_object.predicted_mean
# Extract the forecast dates
datescf = pd.date_range(start=df.index[-1], periods=18, freq='MS')
# Predictions on historical data
in_sample_forecast = best_model_fit.get_prediction(start=pd.to_datetime('1983-05-01'),?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?end=df.index[-1],?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?dynamic=False,?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?exog=df['US Dollar Index'][df.index >= pd.to_datetime('1983-04-01')])
in_sample_mean = in_sample_forecast.predicted_mean
# Replace negative predicted values with 0
in_sample_mean[in_sample_mean < 0] = 0
# Confidence intervals for the in-sample data
in_sample_conf_int = in_sample_forecast.conf_int()
# Correct any negative lower confidence levels in the in-sample data
in_sample_conf_int[in_sample_conf_int.iloc[:, 0] < 0] = 0
# Confidence intervals for the out-of-sample data
conf_int = best_model_fit.get_forecast(steps=len(future_us_dollar_index), exog=future_us_dollar_index).conf_int()
# Correct any negative lower confidence levels in the out-of-sample data
conf_int[conf_int.iloc[:, 0] < 0] = 0
Python SARIMAX ML Result, Crude Oil Forecasting 2023-2024
Python FB Prophet Code, Crude Oil Forecasting 2023-2024
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import ParameterSampler
from prophet import Prophet
from prophet.diagnostics import cross_validation, performance_metrics
import matplotlib.pyplot as plt
from scipy.stats import uniform
df = pd.read_csv('Crude_Oil_WTI.csv', index_col='Date', parse_dates=True)
# Reset index to have Date as a column
df.reset_index(inplace=True)
# Make Crude Oil Price stationary
df['Crude_Oil_WTI_($/bbl)_diff'] = df['Crude_Oil_WTI_($/bbl)'].diff()
df.dropna(inplace=True)
# Create a scaler for Crude Oil Price
price_scaler = MinMaxScaler()
df['Crude_Oil_WTI_($/bbl)_diff_scaled'] = price_scaler.fit_transform(df[['Crude_Oil_WTI_($/bbl)_diff']])
# Normalize the US Dollar Index
scaler = MinMaxScaler()
df['US Dollar Index'] = scaler.fit_transform(df[['US Dollar Index']])
# Prepare dataframe for Prophet
df_prophet = pd.DataFrame({
? ? 'ds': df['Date'],
? ? 'y': df['Crude_Oil_WTI_($/bbl)_diff_scaled'].values,
? ? 'US Dollar Index': df['US Dollar Index'].values
})
# Define the parameter space
param_space = {
? ? 'changepoint_prior_scale': uniform(0.01, 0.5),
? ? 'seasonality_prior_scale': uniform(0.1, 10.0),
? ? 'changepoint_range': uniform(0.8, 0.1),
? ? 'seasonality_mode': ['additive', 'multiplicative'],
? ? 'daily_seasonality': [True, False],
? ? 'weekly_seasonality': [True, False],
? ? 'yearly_seasonality': [True, False]
}
# Define number of iterations for random search
n_iter = 50
# Perform random search
tuning_results = []
for params in ParameterSampler(param_space, n_iter=n_iter):
? ? model = Prophet(**params)
? ? model.add_regressor('US Dollar Index')
? ? model.fit(df_prophet)
? ??
? ? # Perform cross-validation
? ? cv_results = cross_validation(model, initial='3650 days', period='180 days', horizon='365 days')
? ? df_p = performance_metrics(cv_results)
? ? metrics = {}
? ? try:
? ? ? ? metrics['mape'] = df_p['mape'].mean()
? ? except KeyError:
? ? ? ? pass
? ? try:
? ? ? ? metrics['mae'] = df_p['mae'].mean()
? ? except KeyError:
? ? ? ? pass
? ? try:
? ? ? ? metrics['rmse'] = df_p['rmse'].mean()
? ? except KeyError:
? ? ? ? pass
? ? tuning_results.append({
? ? ? ? 'params': params,
? ? ? ? 'metrics': metrics,
? ? })
# Find the best parameters
best_params = min(tuning_results, key=lambda x: x['metrics'].get('mape', float('inf')))['params']
# Print the best parameters
print('Best Parameters:')
print(best_params)
# Use best parameters to predict future values
model = Prophet(**best_params)
model.add_regressor('US Dollar Index')
model.fit(df_prophet)
# Prepare future data
future_data = pd.DataFrame({
? ? 'ds': pd.date_range(start='2023-08-01', periods=17, freq='M'),
? ? 'US Dollar Index': [103.06, 103.66, 104.33, 103.57, 102.59, 99.42, 99.48, 101.97, 102.85, 103.33, 102.4,?
? ? ? ? ? ? ? ? ? ? ? ? ? 101.23, 100.08, 99.78, 98.72, 97.62, 98.34]})
# Normalize US Dollar Index
future_data['US Dollar Index'] = scaler.transform(future_data[['US Dollar Index']])
# Predict
forecast = model.predict(future_data)
# Denormalize 'yhat'
forecast['yhat'] = price_scaler.inverse_transform(forecast[['yhat']])
forecast['yhat_lower'] = price_scaler.inverse_transform(forecast[['yhat_lower']])
forecast['yhat_upper'] = price_scaler.inverse_transform(forecast[['yhat_upper']])
# Undifference the predictions
last_value = df.iloc[-1]['Crude_Oil_WTI_($/bbl)']
forecast['yhat'] = np.r_[last_value, forecast['yhat'].values].cumsum()[1:]
forecast['yhat_lower'] = np.r_[last_value, forecast['yhat_lower'].values].cumsum()[1:]
forecast['yhat_upper'] = np.r_[last_value, forecast['yhat_upper'].values].cumsum()[1:]
# Print predicted Crude Oil prices
print("Predicted Crude Oil Prices:")
print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']])
d
Python FB Prophet Result, Crude Oil Forecasting 2023-2024
Python LSTM DL Code, Crude Oil Forecasting 2023-2024
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
df = pd.read_csv('Crude_Oil_WTI.csv', index_col='Date', parse_dates=True)
# Reset index to have Date as a column
df.reset_index(inplace=True)
# Engineer features from 'Date'
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
# Create a separate scaler for Crude Oil Price
price_scaler = MinMaxScaler()
df['Crude_Oil_WTI_($/bbl)_diff'] = df['Crude_Oil_WTI_($/bbl)'].diff()
df.dropna(inplace=True)
df['Crude_Oil_WTI_($/bbl)_diff_scaled'] = price_scaler.fit_transform(df[['Crude_Oil_WTI_($/bbl)_diff']])
# Normalize the other data
scaler = MinMaxScaler()
df[['Year', 'Month', 'US Dollar Index']] = scaler.fit_transform(df[['Year', 'Month', 'US Dollar Index']])
# Prepare the data for LSTM
window_size = 10
features_count = df[['Year', 'Month', 'US Dollar Index']].shape[1]
X = []
y = []
for i in range(len(df) - window_size):
? ? X.append(df.iloc[i:i+window_size, [3, 4, 5]].values)
? ? y.append(df.iloc[i+window_size]['Crude_Oil_WTI_($/bbl)_diff_scaled'])
X = np.array(X)
y = np.array(y)
# Split the data into train and test sets
train_size = int(0.8 * len(X))
train_X, test_X = X[:train_size], X[train_size:]
train_y, test_y = y[:train_size], y[train_size:]
# Define the LSTM model
def create_model(neurons=50, optimizer='adam', dropout_rate=0.0):
? ? model = Sequential()
? ? model.add(LSTM(neurons, activation='relu', return_sequences=True, input_shape=(window_size, features_count)))
? ? model.add(Dropout(dropout_rate))
? ? model.add(LSTM(neurons, activation='relu'))
? ? model.add(Dropout(dropout_rate))
? ? model.add(Dense(1))
? ? model.compile(optimizer=optimizer, loss='mean_squared_error')
? ? return model
# Create KerasRegressor
model = KerasRegressor(build_fn=create_model, verbose=0)
# Define hyperparameters to tune
param_dist = {
? ? 'neurons': [50, 100, 150, 200],
? ? 'optimizer': ['adam', 'rmsprop'],
? ? 'dropout_rate': [0.0, 0.1, 0.2, 0.3],
? ? 'batch_size': [16, 32, 64, 128],
? ? 'epochs': [10, 20, 30]
}
# Perform random search
random = RandomizedSearchCV(estimator=model, param_distributions=param_dist, scoring='neg_mean_squared_error', n_iter=10)
random_result = random.fit(train_X, train_y)
# Get the best model
best_model = random_result.best_estimator_
best_params = random_result.best_params_
print("Best Model Parameters:")
print(best_params)
# Predictions on test set and MSE calculation
test_y_pred = random_result.predict(test_X)
print('Test MSE:', mean_squared_error(test_y, test_y_pred))
# denormalize the 'US Dollar Index'
df['US Dollar Index'] = scaler.inverse_transform(df[['Year', 'Month', 'US Dollar Index']])[:, 2]
# Prepare future data
future_data = np.array([
? ? [2023, 8, 5.3],
? ? [2023, 9, 5.3],
? ? [2023, 10, 5.3],
? ? [2023, 11, 5.3],
? ? [2023, 12, 5.3],
? ? [2024, 1, 5.6],
? ? [2024, 2, 5.6],
? ? [2024, 3, 5.6],
? ? [2024, 4, 5.6],
? ? [2024, 5, 5.6],
? ? [2024, 6, 5.6],
? ? [2024, 7, 5.3],
? ? [2024, 8, 5.3],
? ? [2024, 9, 5.3],
? ? [2024, 10, 5.06],
? ? [2024, 11, 5.06],
? ? [2024, 12, 5.06]
])
future_us_dollar_index = np.array([103.06, 103.66, 104.33, 103.57, 102.59, 99.42, 99.48, 101.97, 102.85, 103.33, 102.4,?
? ? ? ? ? ? ? ? ? ? ? ? ? 101.23, 100.08, 99.78, 98.72, 97.62, 98.34])
# Replace the third column with the future_us_dollar_index
future_data[:, 2] = future_us_dollar_index
from datetime import timedelta
# Assuming that last_date is the last date in your original data
last_date = df['Date'].iloc[-1]
# Generate future dates
future_dates = pd.date_range(last_date, periods=len(future_data[:, 2]) + 1, freq='MS')[1:]
# Normalize future data with the same scaler used for training data
future_data_scaled = scaler.transform(future_data)
# Prepare sequences for prediction
future_X = []
for i in range(len(future_data_scaled)):
? ? sequence = np.concatenate((df.iloc[-window_size + i:, [3, 4, 5]].values, future_data_scaled[:i + 1, :]), axis=0)
? ? future_X.append(sequence[-window_size:])
future_X = np.array(future_X)
# Make predictions on the future data
future_y_pred = best_model.predict(future_X)
# Transform future_y_pred to original scale
future_y_pred = price_scaler.inverse_transform(future_y_pred.reshape(-1, 1))
# Undifference the predictions
last_value = df.iloc[-1]['Crude_Oil_WTI_($/bbl)']
future_y_pred = np.r_[last_value, future_y_pred.flatten()].cumsum()[1:]
# Print predicted Crude Oil prices
print("Predicted Crude Oil Prices:")
print(future_y_pred)
# Concatenate the historical and predicted dates
total_dates = np.concatenate((df['Date'].values, future_dates))
Python LSTM DL Result, Crude Oil Forecasting 2023-2024
Author Notes: Beyond the Text
NUML University Islamabad
6 个月Is there any Data Analyst. I have a paid task
Production F.A Engineer & Instructor | Trained +600 Petroleum Engineers on Python |
7 个月This is a Gem..
Analytics & AI Specialist
9 个月I was wondering where is the plotting code
MSc. Econometrics
1 年Great article, thanks for sharing!
Virtual Assistant
1 年Oil is really voliatile since March 2021. I follow some Oil Traders in Dubai who seem to know what they're talking about, check out their analysis here? https://www.youtube.com/watch?v=7ybikzE0m-U