Time Series Episode 0: Familiarize with ARIMA and its parameters
Vasilis Kalyvas
Senior Data Scientist at Coca-Cola HBC | AI/ML online articles & tutorials
Introduction
A Time Series is a series of data points ordered in time. Simple as that.
Each data point in a Time Series corresponds to a specific moment in time and represents some numerical value and we perform analysis to understand and analyze patterns, trends, and behaviors over time.
We have all been faced with a Time Series in our everyday lives. We tend to measure our heart rate daily or even multiple times within the same day, we keep track of the daily price of an object during Black Friday, we fear the changes of temperature when planning a trip or observe the ups-and-downs of Stock prices (ok, maybe not for everyone’s “everyday lives”).
Time Series are everywhere and it is essential for a data professional to have a starting point for their modeling and forecasting. So, in this article, I will briefly describe two very popular statistical models, a bit of theory around them (I promise, no maths!) and focus on some tips-and-tricks to utilize them effectively in your next forecasting project.
1. ARIMA model
ARIMA stands for “AutoRegressive Integrated Moving Average” and is a popular time series forecasting model. It is used to predict future points in a time series based on past observations.
The ARIMA model combines three key components: AutoRegressive (AR), Integrated (I), and Moving Average (MA).
What about its parameters?
The ARIMA model is denoted as ARIMA(p, d, q), where:
When no differencing is needed, then d=0 (the “I”) and ARIMA converts to an ARMA model.
Although ARIMA can handle data with a trend, it does not support time series with a seasonal component. An extension to ARIMA that supports the direct modeling of the seasonal component of the series is called SARIMA(X).
2. SARIMAX model
SARIMA stands for “Seasonal AutoRegressive Integrated Moving Average” and is an extension of the ARIMA model that incorporates seasonality for time series forecasting.
It is usually extended to SARIMAX to also incorporate external variables.
As a result, this model is designed to handle time series data that exhibits both non-seasonal and seasonal patterns while allowing for the inclusion of additional exogenous, or external, variables.
Apart from its ARIMA components that were previously described (also known as “non-seasonal”), the SARIMAX incorporates two additional:
The SARIMAX model is denoted as SARIMAX(p, d, q) × (P, D, Q, S), where:
3. Stationarity
Stationarity is a key concept in Time Series analysis and forecasting because many models, including ARIMA, work properly when the data is stationary or can be transformed into a stationary series whose properties do not change over time.
Generally speaking, a Time Series is considered stationary when it has:
Time Series with trends, or with seasonality, are not stationary; the trend and seasonality will affect the value of the Time Series at different times.
Real-world Time Series are almost certainly non-stationary. How to solve this when modeling with ARIMA?
→ Compute the differences between consecutive observations: differencing.
Differencing can help stabilize the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality. Then we can use statistical tests to confirm that the result is stationary.
4. ACF — PACF plots
ACF (AutoCorrelation Function) and PACF (Partial AutoCorrelation Function) plots are essential tools in Time Series analysis, particularly when working with models like ARIMA . These plots help identify the appropriate orders of the AutoRegressive (AR) and Moving Average (MA) components in a Time Series model.
4.1. ACF (AutoCorrelation Function) Plot:
The ACF plot displays the correlation coefficients between a Time Series and its lagged values.
Each point on the ACF plot represents the correlation between the time series values at time t and the values at previous time steps (lags).
The horizontal axis represents the time lags at which the autocorrelation is calculated. Lag 0 corresponds to the correlation of the time series with itself, lag 1 corresponds to the correlation with the immediately preceding time point, and so on.
The vertical axis represents the correlation coefficients between the time series and its lagged values, ranging from -1 to 1.
Peaks or spikes in the ACF plot indicate significant autocorrelation at certain lags. The height of these spikes represents the strength of the correlation.
Statistical significance is often determined by confidence intervals (the blue area in the above image). If the autocorrelation values fall outside the confidence interval bands, either going up or down, they are considered significant.
The ACF plot helps identify the q parameter (MA) and I will show later how.
4.2. PACF (Partial AutoCorrelation Function) Plot:
The PACF plot displays the partial correlation coefficients between a time series and its lagged values, removing the effects of intermediate lags. It helps identify the direct relationship between observations at different lags.
The “direct relationship” is the key phrase here. Let’s work on a quick example:
Imagine you want to predict the monthly sales of a clothing shop and your first prediction month is November (t). You take into consideration the previous sales in October (t-1), September (t-2) etc. The sales of September might have an effect on sales of October and, sequentially, October might have an effect on November. That’s indirect effect of September on November, calculated by Pearson’s correlation coefficients.
So: t-2 affects t-1 and t-1 affects t.
However, September might also have a direct effect on November, too, because there could be some kind of seasonality, such as price promotions every two months (too good to be true!).
So, t-2 directly affects t, meaning that September directly impacts November’s sales.
Both direct and indirect effects are captured by ACF. However, PACF captures only the direct effect.
PACF is then just a linear regression of lagged months’ sales and the impact of September is just its coefficient in the formula.
领英推荐
The structure (x,y axis, spikes, etc) is similar to the ACF plot.
The PACF plot helps identify the p parameter (AR), shown later.
5. Tips and tricks for selecting parameters
Now, the interesting stuff! You have a Time Series, how to model with ARIMA models?
From my experience I have found out these rules and sequence of steps usually work:
Step 1: Examine the stationarity.
Either with a statistical test named ADF or by viewing the trend and seasonal components of the Time Series and/or ACF & PACF plots (we will see in detail in next article).
· Stationary? Move on.
· Not Stationary? Perform differencing:
→ Is there an obvious upward or downward linear trend? Then first-order difference, meaning that d=1.
→ Is there a quadratic trend? Then second-order difference, meaning that d=2 (we rarely want to go much beyond two, in those cases, we might want to think about things like smoothing).
→ Is there a curved upward trend accompanied by increasing variance? transform the series with either a logarithm or a square root.
Step 2: Calculate the ACF & PACF plots on the stationary data (e.g. the differenced Time Series)
Keep in mind that it is essential to plot them on stationary data, otherwise results will be wrong/misinterpreted.
Step 3: Identify the order of AR and MA components
· p = last lag where the PACF value is out of the significance band (displayed by the confidence interval).
· q = last lag where the ACF value is out of the significance band (displayed by the confidence interval).
Example:
· Rules of Thumb:
→ If there is a sharp drop in the ACF plot after lag k and a significant spike in the PACF plot at lag k, consider p=k and q=0.
→ If there is a sharp drop in the PACF plot after lag k and a significant spike in the ACF plot at lag k, consider p=0 and q=k.
Step 4: Identify the seasonal components of the model
· S is equal to the ACF lag with the highest value (typically at a high lag).
· D=1 if the series has a stable seasonal pattern over time.
· D=0 if the series has an unstable seasonal pattern over time.
· Rule of thumb: d+D≤2
· P≥1 if the ACF is positive at lag S, else P=0.
· Q≥1 if the ACF is negative at lag S, else Q=0.
· Rule of thumb: P+Q≤2
· Also, P-Q can be identified when there is a significant spike at S-lag in the PACF-ACF (repectively). Example for monthly data, where S=12 (because there are 12 months in a year):
Step 5. Investigate exogenous variables
You can try to add more exogenous variables when provided and measure the effectiveness of your model. Keep in mind, though, that they will not always be useful at all.
Step 6. Model Evaluation
We can evaluate an ARIMA/SARIMAX model with Diagnostics for Residuals.
(residual is the difference between true and predicted value)
A good forecasting method will yield residuals with the following properties:
Additional tip!
How to decide differencing when looking at the autocorrelation plots:
Conclusion
In this article I described briefly some key concepts around ARIMA models for Time Series forecasting. We didn’t get much into mathematical details, instead the goal was to provide some steps and “workarounds” that I have learnt after working on multiple projects. I hope they seem useful for you, too.
Remember that these workarounds should serve as a starting point for your analysis, most often you need to try multiple combinations of parameters to find the correct ones. And some other times, you may need to turn to another model entirely, because ARIMA-family won’t work well.
In the next articles, I will present coding examples (in Python) of ARIMA models for Time Series, each of them focusing on different aspects of the modeling phase.
You can also read my original article published on Medium:
Time Series Episode 0: Familiarize with ARIMA and select parameters | by Vasilis Kalyvas | Nov, 2023 | Python in Plain English (medium.com)
Thanks for reading!