Time Series Episode 2: What happens with strong seasonality
Vasilis Kalyvas
Senior Data Scientist at Coca-Cola HBC | AI/ML online articles & tutorials
Introduction
Hi there! Happy to see you again in this series of articles, where we discuss about Time Series, theory and examples.
In the first article (Episode 0) I presented a brief theory of the ARIMA-family models for forecasting, as well as some general guidelines on how to select correct parameters for these types of models, based on my experience from multiple projects so far. Have a look here to remind yourself of a bit of theory and tips-n-tricks:
on LinkedIn:
or on Medium:
Time Series Episode 0: Familiarize with ARIMA and select parameters | by Vasilis Kalyvas | Nov, 2023 | Python in Plain English (medium.com)
In the second article (Episode 1) I presented a working example in Python and put our theory and tips in action with a real world dataset. Have a look here, as well:
on LinkedIn:
or on Medium:
Time Series Episode 1: How to select the correct SARIMA parameters | by Vasilis Kalyvas | Nov, 2023 | Python in Plain English (medium.com)
In this article we will work together on an example with the “monthly-sunspots” dataset that you may also have encountered before on tutorials.
Once again, I will show you how to interpret statistical tests and ACF/PACF plots to identify the correct seasonal parameters.
Let’s start!
Step-by-Step Working Example
In order to follow my steps, the dataset is here:
Generally, Sunspots are “phenomena on the Sun’s photosphere that appear as temporary spots that are darker than the surrounding areas”. You can read more here.
Bring and inspect the data:
We have monthly number of sunspots (2,820 datapoints in total), starting from January 1749 until December 1983.
Firstly we should transform the “Month” variable to datetime and set it as the dataset’s frequency:
# convert month to datetime and set as index
df['Month'] = pd.to_datetime(df['Month'])
df = df.set_index('Month')
df = df.asfreq(pd.infer_freq(df.index))
Now let’s view the data:
There is no clear upward or downward trend. However, there is a clear seasonality nearly every 10 years. Let’s follow the guide in my previous article and find out more in detail:
Step 1: Examine the stationarity.
ARIMA models work best with stationary data. This is because these models consider that the statistical properties of the time series do not change over time.
A Time Series is stationary when it has:
We saw from Image 1 that there is no clear trend, so the Time Series might not need differencing (i.e. subtracting every value from its previous or from its seasonal lag).
We can confirm this with a popular statistical test, named Augmented Dickey-Fuller (ADF) Test, according to which:
# import the ADF test
from statsmodels.tsa.stattools import adfuller
# create a function that returns the necessary metrics to test stationarity
def test_stationarity(timeseries):
dftest_initial = adfuller(timeseries)
dfoutput_initial = pd.Series(dftest_initial[0:4],
index=['Statistical Test',
'p-value',
'#Lags used',
'Number of observations'
])
for key, value in dftest_initial[4].items():
dfoutput_initial['Critical value ' + key] = value
print(dfoutput_initial)
print('\n')
Let’s test the stationarity:
We see that:
It seems that the Time Series is already stationary and that is also confirmed with the seasonal ups and downs in Image 2.
Let’s now decompose the original Time Series and view all the components:
# import the necessary module
from statsmodels.tsa.seasonal import seasonal_decompose
# decompose the time series into its trend, seasonal and residuals components
result_decompose = seasonal_decompose(df, model='additive')
trend = result_decompose.trend
seasonal = result_decompose.seasonal
residuals = result_decompose.resid
# plot every component
plt.figure(figsize=(20,10))
plt.subplot(311)
plt.plot(trend)
plt.title('trend')
plt.subplot(312)
plt.plot(seasonal)
plt.title('seasonality')
plt.subplot(313)
plt.plot(residuals)
plt.title('residuals')
The trend is oscillating and that confirms the stationarity and the seasonality of the Time Series.
Look closely and you can also see that the patterns is repeated every 10 years (for example, you can see 4 peaks between 1840 and 1880). That means every 120 months.
We will try to predict the last 12 months based on all previous data. Then we will compare actual values against our predictions. Let’s define our train and test sets:
testing_timeframe = 12
train = df[:-testing_timeframe]
test = df[-testing_timeframe:]
print('training set (past data): ', len(train))
print('test set (months to be forecasted ahead): ', len(test))
training set (past data): 2808
test set (days to be forecasted ahead): 12
Step 2: Calculate the ACF & PACF plots on the stationary data
# import necessary modules
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# define function that returns the ACF and PACF plots for a given time series
def autocorrelation_plots(timeseries, description, n_lags):
plt.figure(figsize=(15,5))
plt.subplot(121)
plot_acf(timeseries, ax=plt.gca(), lags=n_lags)
plt.title('Autocorrelation ({})'.format(description))
plt.xlabel('Number of lags')
plt.ylabel('correlation')
plt.subplot(122)
plot_pacf(timeseries, ax=plt.gca(), lags=n_lags)
plt.title('Partial Autocorrelation ({})'.format(description))
plt.xlabel('Number of lags')
plt.ylabel('correlation')
plt.show()
Now we see something contradicting to our initial thoughts. The ACF plot of the original values of train data has a sinusoidal form, while PACF cuts off sharply and that’s actually a way to identify non-stationarity.
But the ADF test said that the Time Series is stationary! Something is off.
Well yes, and also no. When you look at the Time Series from a more general, high-level view it seems stationary with a pattern repeated every 120 months. But when you look it in shorter horizons it can be regarded as non-stationary.
Now that we try to forecast the last 12 months (1 year) we can focus indeed on the shorter-horizon view, so let’s think of the Time Series as non-stationary and take the first-difference.
Step 3: Identify the order of AR and MA components
We learnt that:
· p = last lag where the PACF value is out of the significance band (displayed by the confidence interval).
· q = last lag where the ACF value is out of the significance band (displayed by the confidence interval).
Apply these rules to the 1st-order differenced Time Series:
So:
→ p = 4 (from PACF plot)
→ d = 1 (because we differenced the Time Series only once)
→ q = 2 (from ACF plot)
Now let’s train an ARIMA model.
# import the necessary modules
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
# create and fit the model
model_fit = ARIMA(train,
order = (4,1,2)
).fit()
print(model_fit.summary())
print('\n')
#####################################################################################
# create forecasts on training set (to evaluate how the model behaves to known-training data)
forecasts_on_train = model_fit.predict()
# create forecasts on test set (to evaluate how the model behaves to unknown-test data)
forecasts_on_test = model_fit.forecast(len(test))
# calculate the root mean squared error on the test set
RMSE = np.sqrt(mean_squared_error(test['Sunspots'], forecasts_on_test))
# print the AIC and RMSE
print('AIC: ' , model_fit.aic)
print('RMSE: ', RMSE)
#####################################################################################
# plot the train and test daat against their corresponding forecasts
# on train data
plt.figure(figsize=(16,4))
plt.plot(train['Sunspots'], label="Actual")
plt.plot(forecasts_on_train, label="Predicted")
plt.legend()
# on test data
plt.figure(figsize=(16,4))
plt.plot(test['Sunspots'], label="Actual")
plt.plot(forecasts_on_test, label="Predicted")
plt.legend()
Results:
领英推荐
In the above Image, note that:
· ar.L1 → 1st lag of PACF (“ar” from “autoregressive”)
· ar.L2 → 2nd lag of PACF
· ma.L1 → 1st lag of ACF (“ma” from “moving average”)
and similarly for the other components.
Forecasts against Training/Test sets:
The forecasts on train data are kind of ok, mostly like a lagged Time Series. However, the forecasts on test set are not good, mostly like a straight line, with an average error of around 42 sunspots per month (i.e. the RMSE).
However, we want to create reasonable and good forecasts on the test set and minimize the AIC and RMSE.
Let’s consider then this model, the ARIMA(4,1,2) as our baseline and reiterate with new parameters.
Re-iteration
We can see in PACF of Image 6 that, apart from the first lags, the others don’t have high correlation coefficients.
Maybe we can try with p = 2 and q = 2 instead:
Again, the forecasts on test set are actually a straight line, so there are no patterns captured. The AIC and RMSE are not decreased either.
It seems that trying different parameters of p,d,q won’t give a drastic improvement. For simplicity reasons let’s keep the last model ARIMA(2,1,2). Also, we can experiment with selecting seasonal components.
Step 4: Identify the seasonal components of the model
There is a strong indication that seasonality exists, mostly by looking at Image 4 (the seasonality component). It shows a pattern repeated every month. According to our previous theoretical tutorial: “D=1 if the series has a stable seasonal pattern over time”. As such, we select D = 1.
The data is on monthly level, as well, and according to our theoretical tutorial: “S is equal to the ACF lag with the highest value (typically at a high lag)”. Looking at Image 6, we see a significant spike at lag 12.
All these indicate that S = 12.
Let’s look now what we said about P, Q (remember, on the differenced Time Series!) :
· P≥1 if the ACF is positive at lag S, else P=0.
· Q≥1 if the ACF is negative at lag S, else Q=0.
· Rule of thumb: P+Q≤2
This makes us select P = 1, Q = 0, so let’s try the following model: SARIMA(2,1,2)(1,1,0,12)
# import the necessary modules
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
# create and fit the model
model_fit = ARIMA(train,
order = (2,1,2)
, seasonal_order = (1,1,0,12)
).fit()
print(model_fit.summary())
print('\n')
#####################################################################################
# create forecasts on training set (to evaluate how the model behaves to known-training data)
forecasts_on_train = model_fit.predict()
# create forecasts on test set (to evaluate how the model behaves to unknown-test data)
forecasts_on_test = model_fit.forecast(len(test))
# calculate the root mean squared error on the test set
RMSE = np.sqrt(mean_squared_error(test['Sunspots'], forecasts_on_test))
# print the AIC and RMSE
print('AIC: ' , model_fit.aic)
print('RMSE: ', RMSE)
#####################################################################################
# plot the train and test daat against their corresponding forecasts
# on train data
plt.figure(figsize=(16,4))
plt.plot(train['Sunspots'], label="Actual")
plt.plot(forecasts_on_train, label="Predicted")
plt.legend()
# on test data
plt.figure(figsize=(16,4))
plt.plot(test['Sunspots'], label="Actual")
plt.plot(forecasts_on_test, label="Predicted")
plt.legend()
Note that the only thing that changes is the “seasonal_order” parameters in the model training.
Results:
Forecasts on Train and Test set:
We decreased the RMSE and the forecasts are not a straight line anymore.
However the results are still not acceptable. It seems that we considered the seasonality component but in a wrong way. What did possibly go wrong?
Let’s look closely at Image 6 again:
When trying to identify the seasonal parameters from the AFC/PACF plots, we made another point in our initial theoretical article:
P-Q can be identified when there is a significant spike at S-lag in the PACF-ACF (respectively).
We can see in the above image that, although we have monthly data that indicate S = 12, however the highest (further) lag is at 15. That means that there might be a seasonality of 15 months in the dataset.
This may be true, especially when we look at the decomposition plots in Image 4 where the seasonality is not 100% clear that is repeated every 12 months.
Let’s try this aspect as well, with the model SARIMA(2,1,2)(1,1,0,15) :
Now that’s for sure an improvement!
We decreased the RMSE by 11 sunspots per month and the forecasts now seem to follow the actual values. It seems that thinking a bit out of the box with the seasonal parameters did the trick.
All the parameters seem to be statistically significant, as well, by looking at the Results table.
Feel free to experiment more and improve the model, but we will stay with this for now.
Keep in mind that ARIMA models tend to perform poorly for larger horizons ahead. The 12 months in our example might be “too much”.
Step 6. Model Evaluation
As a last step, let’s create the Diagnostics Plot for our best model so far, SARIMA(2,1,2)(1,1,0,15):
model_fit.plot_diagnostics(figsize=(14,7))
plt.show()
The “plot_diagnostics” contains:
Our model is good. Well done!
Conclusion
In this article we worked on an example of how to actively and manually select parameters for your (S)ARIMA model. We followed guidelines from my previous articles and discovered that they can be successfully applied in order to forecast an unknown Time Series.
What’s more, we discovered that sometimes we need to think out of the box and be more creative (and observative, at the same time) with the model’s parameters. A simple detail, like the S, might guide you to the solution.
Please have in mind these guidelines usually serve as a starting point in your forecasting project. You can (and should) re-iterate and try more combinations of parameters to achieve best results.
In the next article, we will focus on a more difficult example with exogenous variables. Have a look!
? You can find the Jupyter notebook here: https://www.kaggle.com/code/billykal/notebookfa7b71247e
? You can also read my original article published on Medium by In Plain English :
Time Series Episode 2: What happens with strong seasonality | by Vasilis Kalyvas | Dec, 2023 | Python in Plain English (medium.com)
Thanks for reading!