Time Series Episode 0: Familiarize with ARIMA and its parameters

Time Series Episode 0: Familiarize with ARIMA and its parameters

Introduction

A Time Series is a series of data points ordered in time. Simple as that.

Each data point in a Time Series corresponds to a specific moment in time and represents some numerical value and we perform analysis to understand and analyze patterns, trends, and behaviors over time.

We have all been faced with a Time Series in our everyday lives. We tend to measure our heart rate daily or even multiple times within the same day, we keep track of the daily price of an object during Black Friday, we fear the changes of temperature when planning a trip or observe the ups-and-downs of Stock prices (ok, maybe not for everyone’s “everyday lives”).

Time Series are everywhere and it is essential for a data professional to have a starting point for their modeling and forecasting. So, in this article, I will briefly describe two very popular statistical models, a bit of theory around them (I promise, no maths!) and focus on some tips-and-tricks to utilize them effectively in your next forecasting project.

1. ARIMA model

ARIMA stands for “AutoRegressive Integrated Moving Average” and is a popular time series forecasting model. It is used to predict future points in a time series based on past observations.

The ARIMA model combines three key components: AutoRegressive (AR), Integrated (I), and Moving Average (MA).

  1. AutoRegressive (AR): This component refers to the relationship between an observation and its past values. The AR term captures the idea that the value at any given time point is dependent on its previous values.
  2. Integrated (I): The I component represents differencing, which involves taking the difference between consecutive observations to make the time series stationary. Stationarity is important for many time series models, including ARIMA.
  3. Moving Average (MA): The MA term involves modeling the error term as a linear combination of past error terms. It takes into account the influence of past forecast errors on future values.

What about its parameters?

The ARIMA model is denoted as ARIMA(p, d, q), where:

  • p: order of the AutoRegressive (AR) component
  • d: degree of differencing required to make the time series stationary (I component)
  • q: order of the Moving Average (MA) component

When no differencing is needed, then d=0 (the “I”) and ARIMA converts to an ARMA model.

Although ARIMA can handle data with a trend, it does not support time series with a seasonal component. An extension to ARIMA that supports the direct modeling of the seasonal component of the series is called SARIMA(X).

2. SARIMAX model

SARIMA stands for “Seasonal AutoRegressive Integrated Moving Average” and is an extension of the ARIMA model that incorporates seasonality for time series forecasting.

It is usually extended to SARIMAX to also incorporate external variables.

As a result, this model is designed to handle time series data that exhibits both non-seasonal and seasonal patterns while allowing for the inclusion of additional exogenous, or external, variables.

Apart from its ARIMA components that were previously described (also known as “non-seasonal”), the SARIMAX incorporates two additional:

  1. Seasonal Component (S): SARIMAX includes parameters to capture the seasonality of the time series data. This is crucial for modeling and forecasting data that exhibits periodic patterns at regular intervals.
  2. Exogenous Variables (X): SARIMAX allows for the inclusion of external variables (exogenous factors) that may influence the time series. These additional variables can improve the accuracy of the model by accounting for external factors not inherent in the time series itself.

The SARIMAX model is denoted as SARIMAX(p, d, q) × (P, D, Q, S), where:

  • p, d, q: are the non-seasonal ARIMA parameters
  • P, D, Q, S: are the seasonal parameters, representing the seasonal ARIMA components (similar to non-seasonal but applied to the seasonal part)
  • X: represents the exogenous variables

3. Stationarity

Stationarity is a key concept in Time Series analysis and forecasting because many models, including ARIMA, work properly when the data is stationary or can be transformed into a stationary series whose properties do not change over time.

Generally speaking, a Time Series is considered stationary when it has:

  • Constant Mean: The mean of the time series is constant over time.
  • Constant Variance: The variance of the time series is constant over time.
  • Constant Autocorrelation: The autocorrelation between the values of the time series at different time lags is constant.

Time Series with trends, or with seasonality, are not stationary; the trend and seasonality will affect the value of the Time Series at different times.

Real-world Time Series are almost certainly non-stationary. How to solve this when modeling with ARIMA?

→ Compute the differences between consecutive observations: differencing.

Differencing can help stabilize the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality. Then we can use statistical tests to confirm that the result is stationary.

4. ACF — PACF plots

ACF (AutoCorrelation Function) and PACF (Partial AutoCorrelation Function) plots are essential tools in Time Series analysis, particularly when working with models like ARIMA . These plots help identify the appropriate orders of the AutoRegressive (AR) and Moving Average (MA) components in a Time Series model.

ACF (left) — PACF (right) plots. Image by Author.

4.1. ACF (AutoCorrelation Function) Plot:

The ACF plot displays the correlation coefficients between a Time Series and its lagged values.

Each point on the ACF plot represents the correlation between the time series values at time t and the values at previous time steps (lags).

The horizontal axis represents the time lags at which the autocorrelation is calculated. Lag 0 corresponds to the correlation of the time series with itself, lag 1 corresponds to the correlation with the immediately preceding time point, and so on.

The vertical axis represents the correlation coefficients between the time series and its lagged values, ranging from -1 to 1.

Peaks or spikes in the ACF plot indicate significant autocorrelation at certain lags. The height of these spikes represents the strength of the correlation.

Statistical significance is often determined by confidence intervals (the blue area in the above image). If the autocorrelation values fall outside the confidence interval bands, either going up or down, they are considered significant.

The ACF plot helps identify the q parameter (MA) and I will show later how.

4.2. PACF (Partial AutoCorrelation Function) Plot:

The PACF plot displays the partial correlation coefficients between a time series and its lagged values, removing the effects of intermediate lags. It helps identify the direct relationship between observations at different lags.

The “direct relationship” is the key phrase here. Let’s work on a quick example:

Imagine you want to predict the monthly sales of a clothing shop and your first prediction month is November (t). You take into consideration the previous sales in October (t-1), September (t-2) etc. The sales of September might have an effect on sales of October and, sequentially, October might have an effect on November. That’s indirect effect of September on November, calculated by Pearson’s correlation coefficients.

So: t-2 affects t-1 and t-1 affects t.

However, September might also have a direct effect on November, too, because there could be some kind of seasonality, such as price promotions every two months (too good to be true!).

So, t-2 directly affects t, meaning that September directly impacts November’s sales.

Both direct and indirect effects are captured by ACF. However, PACF captures only the direct effect.

PACF is then just a linear regression of lagged months’ sales and the impact of September is just its coefficient in the formula.

The structure (x,y axis, spikes, etc) is similar to the ACF plot.

The PACF plot helps identify the p parameter (AR), shown later.

5. Tips and tricks for selecting parameters

Now, the interesting stuff! You have a Time Series, how to model with ARIMA models?

From my experience I have found out these rules and sequence of steps usually work:

Step 1: Examine the stationarity.

Either with a statistical test named ADF or by viewing the trend and seasonal components of the Time Series and/or ACF & PACF plots (we will see in detail in next article).

· Stationary? Move on.

· Not Stationary? Perform differencing:

→ Is there an obvious upward or downward linear trend? Then first-order difference, meaning that d=1.

→ Is there a quadratic trend? Then second-order difference, meaning that d=2 (we rarely want to go much beyond two, in those cases, we might want to think about things like smoothing).

→ Is there a curved upward trend accompanied by increasing variance? transform the series with either a logarithm or a square root.

Step 2: Calculate the ACF & PACF plots on the stationary data (e.g. the differenced Time Series)

Keep in mind that it is essential to plot them on stationary data, otherwise results will be wrong/misinterpreted.

Step 3: Identify the order of AR and MA components

· p = last lag where the PACF value is out of the significance band (displayed by the confidence interval).

· q = last lag where the ACF value is out of the significance band (displayed by the confidence interval).

Example:

Identify p,q from plots. Image by Author.

· Rules of Thumb:

→ If there is a sharp drop in the ACF plot after lag k and a significant spike in the PACF plot at lag k, consider p=k and q=0.

→ If there is a sharp drop in the PACF plot after lag k and a significant spike in the ACF plot at lag k, consider p=0 and q=k.

Step 4: Identify the seasonal components of the model

· S is equal to the ACF lag with the highest value (typically at a high lag).

· D=1 if the series has a stable seasonal pattern over time.

· D=0 if the series has an unstable seasonal pattern over time.

· Rule of thumb: d+D≤2

· P≥1 if the ACF is positive at lag S, else P=0.

· Q≥1 if the ACF is negative at lag S, else Q=0.

· Rule of thumb: P+Q≤2

· Also, P-Q can be identified when there is a significant spike at S-lag in the PACF-ACF (repectively). Example for monthly data, where S=12 (because there are 12 months in a year):

Identify P,Q from plots. Image by Author.

Step 5. Investigate exogenous variables

You can try to add more exogenous variables when provided and measure the effectiveness of your model. Keep in mind, though, that they will not always be useful at all.

Step 6. Model Evaluation

We can evaluate an ARIMA/SARIMAX model with Diagnostics for Residuals.

(residual is the difference between true and predicted value)

A good forecasting method will yield residuals with the following properties:

  1. The residuals are uncorrelated. If there are correlations between residuals, then there is information left in the residuals which should be used in computing forecasts.
  2. The residuals have zero mean. If the residuals have a mean other than zero, then there are underlying patterns in the data.
  3. The residuals have constant variance.
  4. The residuals are normally distributed.

Additional tip!

How to decide differencing when looking at the autocorrelation plots:

  1. If the ACF and PACF do not tail off, but instead have values that stay close to 1 over many lags, the series is non-stationary and differencing will be needed. Try a first difference and then look at the ACF and PACF of the differenced data.
  2. If the series autocorrelations are non-significant, then the series is random (white noise; the ordering matters, but the data are independent and identically distributed.) We are ok at that point.
  3. If first differences were necessary and all the differenced autocorrelations are non-significant, then the original series is called a random walk and we are ok. The data is dependent and is not identically distributed; actually, both the mean and variance are increasing through time.

Conclusion

In this article I described briefly some key concepts around ARIMA models for Time Series forecasting. We didn’t get much into mathematical details, instead the goal was to provide some steps and “workarounds” that I have learnt after working on multiple projects. I hope they seem useful for you, too.

Remember that these workarounds should serve as a starting point for your analysis, most often you need to try multiple combinations of parameters to find the correct ones. And some other times, you may need to turn to another model entirely, because ARIMA-family won’t work well.

In the next articles, I will present coding examples (in Python) of ARIMA models for Time Series, each of them focusing on different aspects of the modeling phase.

You can also read my original article published on Medium:

Time Series Episode 0: Familiarize with ARIMA and select parameters | by Vasilis Kalyvas | Nov, 2023 | Python in Plain English (medium.com)

Thanks for reading!

要查看或添加评论,请登录

Vasilis Kalyvas的更多文章

社区洞察

其他会员也浏览了