Time Series Episode 2: What happens with strong seasonality

Time Series Episode 2: What happens with strong seasonality

Introduction

Hi there! Happy to see you again in this series of articles, where we discuss about Time Series, theory and examples.

In the first article (Episode 0) I presented a brief theory of the ARIMA-family models for forecasting, as well as some general guidelines on how to select correct parameters for these types of models, based on my experience from multiple projects so far. Have a look here to remind yourself of a bit of theory and tips-n-tricks:

on LinkedIn:

https://www.dhirubhai.net/pulse/time-series-episode-0-familiarize-arima-its-vasilis-kalyvas-sbw1f/?trackingId=ioYylXKcTEKDpmyN039zPQ%3D%3D

or on Medium:

Time Series Episode 0: Familiarize with ARIMA and select parameters | by Vasilis Kalyvas | Nov, 2023 | Python in Plain English (medium.com)


In the second article (Episode 1) I presented a working example in Python and put our theory and tips in action with a real world dataset. Have a look here, as well:

on LinkedIn:

https://www.dhirubhai.net/pulse/time-series-episode-1-how-select-correct-sarima-vasilis-kalyvas-jqcjf/?trackingId=415ekgQ6TMaMdXPJG9Xv6w%3D%3D

or on Medium:

Time Series Episode 1: How to select the correct SARIMA parameters | by Vasilis Kalyvas | Nov, 2023 | Python in Plain English (medium.com)


In this article we will work together on an example with the “monthly-sunspots” dataset that you may also have encountered before on tutorials.

Once again, I will show you how to interpret statistical tests and ACF/PACF plots to identify the correct seasonal parameters.

Let’s start!


Step-by-Step Working Example

In order to follow my steps, the dataset is here:

https://www.kaggle.com/datasets/billykal/monthly-sunspots

Generally, Sunspots are “phenomena on the Sun’s photosphere that appear as temporary spots that are darker than the surrounding areas”. You can read more here.

Bring and inspect the data:

Image 1 — Sample data

We have monthly number of sunspots (2,820 datapoints in total), starting from January 1749 until December 1983.

Firstly we should transform the “Month” variable to datetime and set it as the dataset’s frequency:

# convert month to datetime and set as index

df['Month'] = pd.to_datetime(df['Month'])
df = df.set_index('Month')
df = df.asfreq(pd.infer_freq(df.index))        

Now let’s view the data:

Image 2 — Time Series evolution

There is no clear upward or downward trend. However, there is a clear seasonality nearly every 10 years. Let’s follow the guide in my previous article and find out more in detail:

Step 1: Examine the stationarity.

ARIMA models work best with stationary data. This is because these models consider that the statistical properties of the time series do not change over time.

A Time Series is stationary when it has:

  • Constant Mean: The mean of the time series is constant over time.
  • Constant Variance: The variance of the time series is constant over time.
  • Constant Autocorrelation: The autocorrelation between the values of the time series at different time lags is constant.

We saw from Image 1 that there is no clear trend, so the Time Series might not need differencing (i.e. subtracting every value from its previous or from its seasonal lag).

We can confirm this with a popular statistical test, named Augmented Dickey-Fuller (ADF) Test, according to which:

  • Null Hypothesis (H0): The Time Series has a unit root (non-stationary).
  • Alternative Hypothesis (H1): The Time Series is stationary.
  • If the p-value is less than a chosen significance level (e.g. 0.05), you reject the null hypothesis and consider that Time Series is stationary.

# import the ADF test
from statsmodels.tsa.stattools import adfuller

# create a function that returns the necessary metrics to test stationarity
def test_stationarity(timeseries):
    dftest_initial = adfuller(timeseries)
    dfoutput_initial = pd.Series(dftest_initial[0:4], 
          index=['Statistical Test', 
                 'p-value', 
                 '#Lags used', 
                 'Number of observations'
                 ])
    for key, value in dftest_initial[4].items():
        dfoutput_initial['Critical value ' + key] = value
    print(dfoutput_initial)
    print('\n')        

Let’s test the stationarity:

Image 3 — Stationarity

We see that:

  1. the original values of sunspots have a p-value of 2.333452e-16 << 0.05 → stationary (that’s a very small number, too many zeros before 2.333452!)
  2. the first-order differenced (lagged) values of sunspots have a p-value of 5.219691 << 0.05 → stationary (again very small number).

It seems that the Time Series is already stationary and that is also confirmed with the seasonal ups and downs in Image 2.

Let’s now decompose the original Time Series and view all the components:

# import the necessary module
from statsmodels.tsa.seasonal import seasonal_decompose

# decompose the time series into its trend, seasonal and residuals components
result_decompose = seasonal_decompose(df, model='additive')
trend     = result_decompose.trend
seasonal  = result_decompose.seasonal
residuals = result_decompose.resid
# plot every component
plt.figure(figsize=(20,10))
plt.subplot(311)
plt.plot(trend)
plt.title('trend')
plt.subplot(312)
plt.plot(seasonal)
plt.title('seasonality')
plt.subplot(313)
plt.plot(residuals)
plt.title('residuals')        
Image 4 — Time Series Decomposition

The trend is oscillating and that confirms the stationarity and the seasonality of the Time Series.

Look closely and you can also see that the patterns is repeated every 10 years (for example, you can see 4 peaks between 1840 and 1880). That means every 120 months.

We will try to predict the last 12 months based on all previous data. Then we will compare actual values against our predictions. Let’s define our train and test sets:

testing_timeframe = 12

train = df[:-testing_timeframe]
test  = df[-testing_timeframe:]
print('training set (past data): ', len(train))
print('test set (months to be forecasted ahead): ', len(test))        
training set (past data):  2808
test set (days to be forecasted ahead):  12        


Step 2: Calculate the ACF & PACF plots on the stationary data

# import necessary modules
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# define function that returns the ACF and PACF plots for a given time series
def autocorrelation_plots(timeseries, description, n_lags):
    plt.figure(figsize=(15,5))
    plt.subplot(121)
    plot_acf(timeseries, ax=plt.gca(), lags=n_lags)
    plt.title('Autocorrelation ({})'.format(description))
    plt.xlabel('Number of lags')
    plt.ylabel('correlation')
    plt.subplot(122)
    plot_pacf(timeseries, ax=plt.gca(), lags=n_lags)
    plt.title('Partial Autocorrelation ({})'.format(description))
    plt.xlabel('Number of lags')
    plt.ylabel('correlation')
    plt.show()        
Image 5 — ACF/PACF Plots

Now we see something contradicting to our initial thoughts. The ACF plot of the original values of train data has a sinusoidal form, while PACF cuts off sharply and that’s actually a way to identify non-stationarity.

But the ADF test said that the Time Series is stationary! Something is off.

Well yes, and also no. When you look at the Time Series from a more general, high-level view it seems stationary with a pattern repeated every 120 months. But when you look it in shorter horizons it can be regarded as non-stationary.

Now that we try to forecast the last 12 months (1 year) we can focus indeed on the shorter-horizon view, so let’s think of the Time Series as non-stationary and take the first-difference.


Step 3: Identify the order of AR and MA components

We learnt that:

· p = last lag where the PACF value is out of the significance band (displayed by the confidence interval).

· q = last lag where the ACF value is out of the significance band (displayed by the confidence interval).

Apply these rules to the 1st-order differenced Time Series:

Image 6 — ACF/PACF Plots of 1st-order differenced

So:

p = 4 (from PACF plot)

d = 1 (because we differenced the Time Series only once)

q = 2 (from ACF plot)

Now let’s train an ARIMA model.

# import the necessary modules
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

# create and fit the model
model_fit = ARIMA(train,  
                  order = (4,1,2)
                  ).fit()
print(model_fit.summary())
print('\n')

#####################################################################################

# create forecasts on training set (to evaluate how the model behaves to known-training data)
forecasts_on_train = model_fit.predict()
# create forecasts on test set (to evaluate how the model behaves to unknown-test data)
forecasts_on_test  = model_fit.forecast(len(test))
# calculate the root mean squared error on the test set
RMSE = np.sqrt(mean_squared_error(test['Sunspots'], forecasts_on_test))
# print the AIC and RMSE 
print('AIC: ' , model_fit.aic)
print('RMSE: ', RMSE)

#####################################################################################

# plot the train and test daat against their corresponding forecasts
# on train data
plt.figure(figsize=(16,4))
plt.plot(train['Sunspots'], label="Actual")
plt.plot(forecasts_on_train, label="Predicted")
plt.legend()
# on test data
plt.figure(figsize=(16,4))
plt.plot(test['Sunspots'], label="Actual")
plt.plot(forecasts_on_test, label="Predicted")
plt.legend()        

Results:

Image 7 — Results of 1st ARIMA model (4,1,2)

In the above Image, note that:

· ar.L1 → 1st lag of PACF (“ar” from “autoregressive”)

· ar.L2 → 2nd lag of PACF

· ma.L1 → 1st lag of ACF (“ma” from “moving average”)

and similarly for the other components.

Forecasts against Training/Test sets:

Image 8 — Forecasts on Train set from 1st ARIMA model (4,1,2)
Image 9 — Forecasts on Test set from 1st ARIMA model (4,1,2)

The forecasts on train data are kind of ok, mostly like a lagged Time Series. However, the forecasts on test set are not good, mostly like a straight line, with an average error of around 42 sunspots per month (i.e. the RMSE).

However, we want to create reasonable and good forecasts on the test set and minimize the AIC and RMSE.

Let’s consider then this model, the ARIMA(4,1,2) as our baseline and reiterate with new parameters.


Re-iteration

We can see in PACF of Image 6 that, apart from the first lags, the others don’t have high correlation coefficients.

Maybe we can try with p = 2 and q = 2 instead:

Image 10 — Results of 2nd ARIMA model (2,1,2)
Image 11 — Forecasts on Test set from 2nd ARIMA model (2,1,2)

Again, the forecasts on test set are actually a straight line, so there are no patterns captured. The AIC and RMSE are not decreased either.

It seems that trying different parameters of p,d,q won’t give a drastic improvement. For simplicity reasons let’s keep the last model ARIMA(2,1,2). Also, we can experiment with selecting seasonal components.


Step 4: Identify the seasonal components of the model

There is a strong indication that seasonality exists, mostly by looking at Image 4 (the seasonality component). It shows a pattern repeated every month. According to our previous theoretical tutorial: “D=1 if the series has a stable seasonal pattern over time”. As such, we select D = 1.

The data is on monthly level, as well, and according to our theoretical tutorial: “S is equal to the ACF lag with the highest value (typically at a high lag)”. Looking at Image 6, we see a significant spike at lag 12.

All these indicate that S = 12.

Let’s look now what we said about P, Q (remember, on the differenced Time Series!) :

· P≥1 if the ACF is positive at lag S, else P=0.

· Q≥1 if the ACF is negative at lag S, else Q=0.

· Rule of thumb: P+Q≤2

This makes us select P = 1, Q = 0, so let’s try the following model: SARIMA(2,1,2)(1,1,0,12)

# import the necessary modules
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

# create and fit the model
model_fit = ARIMA(train,  
                  order = (2,1,2)
                  , seasonal_order = (1,1,0,12)
                  ).fit()
print(model_fit.summary())
print('\n')

#####################################################################################

# create forecasts on training set (to evaluate how the model behaves to known-training data)
forecasts_on_train = model_fit.predict()
# create forecasts on test set (to evaluate how the model behaves to unknown-test data)
forecasts_on_test  = model_fit.forecast(len(test))
# calculate the root mean squared error on the test set
RMSE = np.sqrt(mean_squared_error(test['Sunspots'], forecasts_on_test))
# print the AIC and RMSE 
print('AIC: ' , model_fit.aic)
print('RMSE: ', RMSE)

#####################################################################################

# plot the train and test daat against their corresponding forecasts
# on train data
plt.figure(figsize=(16,4))
plt.plot(train['Sunspots'], label="Actual")
plt.plot(forecasts_on_train, label="Predicted")
plt.legend()
# on test data
plt.figure(figsize=(16,4))
plt.plot(test['Sunspots'], label="Actual")
plt.plot(forecasts_on_test, label="Predicted")
plt.legend()        

Note that the only thing that changes is the “seasonal_order” parameters in the model training.

Results:

Image 12 — Results of 1st SARIMA model (2,1,2)(1,1,0,12)


Forecasts on Train and Test set:

Image 13 — Forecasts on Train set from SARIMA (2,1,2)(1,1,0,12)
Image 14 — Forecasts on Test set from SARIMA (2,1,2)(1,1,0,12)


We decreased the RMSE and the forecasts are not a straight line anymore.

However the results are still not acceptable. It seems that we considered the seasonality component but in a wrong way. What did possibly go wrong?

Let’s look closely at Image 6 again:

Image 15 — Identification of S

When trying to identify the seasonal parameters from the AFC/PACF plots, we made another point in our initial theoretical article:

P-Q can be identified when there is a significant spike at S-lag in the PACF-ACF (respectively).

We can see in the above image that, although we have monthly data that indicate S = 12, however the highest (further) lag is at 15. That means that there might be a seasonality of 15 months in the dataset.

This may be true, especially when we look at the decomposition plots in Image 4 where the seasonality is not 100% clear that is repeated every 12 months.

Let’s try this aspect as well, with the model SARIMA(2,1,2)(1,1,0,15) :

Image 16 — Results of SARIMA (2,1,2)(1,1,0,15)
Image 17 — Forecasts on Test set from SARIMA (2,1,2)(1,1,0,15)


Now that’s for sure an improvement!

We decreased the RMSE by 11 sunspots per month and the forecasts now seem to follow the actual values. It seems that thinking a bit out of the box with the seasonal parameters did the trick.

All the parameters seem to be statistically significant, as well, by looking at the Results table.

Feel free to experiment more and improve the model, but we will stay with this for now.

Keep in mind that ARIMA models tend to perform poorly for larger horizons ahead. The 12 months in our example might be “too much”.


Step 6. Model Evaluation

As a last step, let’s create the Diagnostics Plot for our best model so far, SARIMA(2,1,2)(1,1,0,15):

model_fit.plot_diagnostics(figsize=(14,7))
plt.show()        
Image 18 — Diagnostic Plots for SARIMA (2,1,2)(1,1,0,15)

The “plot_diagnostics” contains:

  1. Standardized residual: There are no obvious patterns in the residuals.
  2. Histogram plus kde estimate: The KDE curve should be very similar to the normal distribution.
  3. Normal Q-Q: Most of the data points should lie on the straight line.
  4. Correlogram: 95% of correlations for lag greater than one should not be significant.

Our model is good. Well done!


Conclusion

In this article we worked on an example of how to actively and manually select parameters for your (S)ARIMA model. We followed guidelines from my previous articles and discovered that they can be successfully applied in order to forecast an unknown Time Series.

What’s more, we discovered that sometimes we need to think out of the box and be more creative (and observative, at the same time) with the model’s parameters. A simple detail, like the S, might guide you to the solution.

Please have in mind these guidelines usually serve as a starting point in your forecasting project. You can (and should) re-iterate and try more combinations of parameters to achieve best results.

In the next article, we will focus on a more difficult example with exogenous variables. Have a look!

? You can find the Jupyter notebook here: https://www.kaggle.com/code/billykal/notebookfa7b71247e

? You can also read my original article published on Medium by In Plain English :

Time Series Episode 2: What happens with strong seasonality | by Vasilis Kalyvas | Dec, 2023 | Python in Plain English (medium.com)

Thanks for reading!

要查看或添加评论,请登录

Vasilis Kalyvas的更多文章

社区洞察

其他会员也浏览了