登录查看更多内容

Time Series Episode 0: Familiarize with ARIMA and its parameters

Vasilis Kalyvas

Senior Data Scientist at Coca-Cola HBC | AI/ML online articles & tutorials

发布日期: 2024年4月21日

+ 关注

Introduction

A Time Series is a series of data points ordered in time. Simple as that.

Each data point in a Time Series corresponds to a specific moment in time and represents some numerical value and we perform analysis to understand and analyze patterns, trends, and behaviors over time.

We have all been faced with a Time Series in our everyday lives. We tend to measure our heart rate daily or even multiple times within the same day, we keep track of the daily price of an object during Black Friday, we fear the changes of temperature when planning a trip or observe the ups-and-downs of Stock prices (ok, maybe not for everyone’s “everyday lives”).

Time Series are everywhere and it is essential for a data professional to have a starting point for their modeling and forecasting. So, in this article, I will briefly describe two very popular statistical models, a bit of theory around them (I promise, no maths!) and focus on some tips-and-tricks to utilize them effectively in your next forecasting project.

1. ARIMA model

ARIMA stands for “AutoRegressive Integrated Moving Average” and is a popular time series forecasting model. It is used to predict future points in a time series based on past observations.

The ARIMA model combines three key components: AutoRegressive (AR), Integrated (I), and Moving Average (MA).

AutoRegressive (AR): This component refers to the relationship between an observation and its past values. The AR term captures the idea that the value at any given time point is dependent on its previous values.
Integrated (I): The I component represents differencing, which involves taking the difference between consecutive observations to make the time series stationary. Stationarity is important for many time series models, including ARIMA.
Moving Average (MA): The MA term involves modeling the error term as a linear combination of past error terms. It takes into account the influence of past forecast errors on future values.

What about its parameters?

The ARIMA model is denoted as ARIMA(p, d, q), where:

p: order of the AutoRegressive (AR) component
d: degree of differencing required to make the time series stationary (I component)
q: order of the Moving Average (MA) component

When no differencing is needed, then d=0 (the “I”) and ARIMA converts to an ARMA model.

Although ARIMA can handle data with a trend, it does not support time series with a seasonal component. An extension to ARIMA that supports the direct modeling of the seasonal component of the series is called SARIMA(X).

2. SARIMAX model

SARIMA stands for “Seasonal AutoRegressive Integrated Moving Average” and is an extension of the ARIMA model that incorporates seasonality for time series forecasting.

It is usually extended to SARIMAX to also incorporate external variables.

As a result, this model is designed to handle time series data that exhibits both non-seasonal and seasonal patterns while allowing for the inclusion of additional exogenous, or external, variables.

Apart from its ARIMA components that were previously described (also known as “non-seasonal”), the SARIMAX incorporates two additional:

Seasonal Component (S): SARIMAX includes parameters to capture the seasonality of the time series data. This is crucial for modeling and forecasting data that exhibits periodic patterns at regular intervals.
Exogenous Variables (X): SARIMAX allows for the inclusion of external variables (exogenous factors) that may influence the time series. These additional variables can improve the accuracy of the model by accounting for external factors not inherent in the time series itself.

The SARIMAX model is denoted as SARIMAX(p, d, q) × (P, D, Q, S), where:

p, d, q: are the non-seasonal ARIMA parameters
P, D, Q, S: are the seasonal parameters, representing the seasonal ARIMA components (similar to non-seasonal but applied to the seasonal part)
X: represents the exogenous variables

3. Stationarity

Stationarity is a key concept in Time Series analysis and forecasting because many models, including ARIMA, work properly when the data is stationary or can be transformed into a stationary series whose properties do not change over time.

Generally speaking, a Time Series is considered stationary when it has:

Constant Mean: The mean of the time series is constant over time.
Constant Variance: The variance of the time series is constant over time.
Constant Autocorrelation: The autocorrelation between the values of the time series at different time lags is constant.

Time Series with trends, or with seasonality, are not stationary; the trend and seasonality will affect the value of the Time Series at different times.

Real-world Time Series are almost certainly non-stationary. How to solve this when modeling with ARIMA?

→ Compute the differences between consecutive observations: differencing.

Differencing can help stabilize the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality. Then we can use statistical tests to confirm that the result is stationary.

4. ACF — PACF plots

ACF (AutoCorrelation Function) and PACF (Partial AutoCorrelation Function) plots are essential tools in Time Series analysis, particularly when working with models like ARIMA . These plots help identify the appropriate orders of the AutoRegressive (AR) and Moving Average (MA) components in a Time Series model.

ACF (left) — PACF (right) plots. Image by Author.

4.1. ACF (AutoCorrelation Function) Plot:

The ACF plot displays the correlation coefficients between a Time Series and its lagged values.

Each point on the ACF plot represents the correlation between the time series values at time t and the values at previous time steps (lags).

The horizontal axis represents the time lags at which the autocorrelation is calculated. Lag 0 corresponds to the correlation of the time series with itself, lag 1 corresponds to the correlation with the immediately preceding time point, and so on.

The vertical axis represents the correlation coefficients between the time series and its lagged values, ranging from -1 to 1.

Peaks or spikes in the ACF plot indicate significant autocorrelation at certain lags. The height of these spikes represents the strength of the correlation.

Statistical significance is often determined by confidence intervals (the blue area in the above image). If the autocorrelation values fall outside the confidence interval bands, either going up or down, they are considered significant.

The ACF plot helps identify the q parameter (MA) and I will show later how.

4.2. PACF (Partial AutoCorrelation Function) Plot:

The PACF plot displays the partial correlation coefficients between a time series and its lagged values, removing the effects of intermediate lags. It helps identify the direct relationship between observations at different lags.

The “direct relationship” is the key phrase here. Let’s work on a quick example:

Imagine you want to predict the monthly sales of a clothing shop and your first prediction month is November (t). You take into consideration the previous sales in October (t-1), September (t-2) etc. The sales of September might have an effect on sales of October and, sequentially, October might have an effect on November. That’s indirect effect of September on November, calculated by Pearson’s correlation coefficients.

So: t-2 affects t-1 and t-1 affects t.

However, September might also have a direct effect on November, too, because there could be some kind of seasonality, such as price promotions every two months (too good to be true!).

So, t-2 directly affects t, meaning that September directly impacts November’s sales.

Both direct and indirect effects are captured by ACF. However, PACF captures only the direct effect.

PACF is then just a linear regression of lagged months’ sales and the impact of September is just its coefficient in the formula.

领英推荐

The Power of Probabilistic Scenarios in Constantly…

International Standard for Lean Six Sigma (ISLSS) 1 年前

A deep dive into... scatter plots

Datylon 2 年前

The Difference Between Random Factors and Random…

The Analysis Factor 1 年前

The structure (x,y axis, spikes, etc) is similar to the ACF plot.

The PACF plot helps identify the p parameter (AR), shown later.

5. Tips and tricks for selecting parameters

Now, the interesting stuff! You have a Time Series, how to model with ARIMA models?

From my experience I have found out these rules and sequence of steps usually work:

Step 1: Examine the stationarity.

Either with a statistical test named ADF or by viewing the trend and seasonal components of the Time Series and/or ACF & PACF plots (we will see in detail in next article).

· Stationary? Move on.

· Not Stationary? Perform differencing:

→ Is there an obvious upward or downward linear trend? Then first-order difference, meaning that d=1.

→ Is there a quadratic trend? Then second-order difference, meaning that d=2 (we rarely want to go much beyond two, in those cases, we might want to think about things like smoothing).

→ Is there a curved upward trend accompanied by increasing variance? transform the series with either a logarithm or a square root.

Step 2: Calculate the ACF & PACF plots on the stationary data (e.g. the differenced Time Series)

Keep in mind that it is essential to plot them on stationary data, otherwise results will be wrong/misinterpreted.

Step 3: Identify the order of AR and MA components

· p = last lag where the PACF value is out of the significance band (displayed by the confidence interval).

· q = last lag where the ACF value is out of the significance band (displayed by the confidence interval).

Example:

Identify p,q from plots. Image by Author.

· Rules of Thumb:

→ If there is a sharp drop in the ACF plot after lag k and a significant spike in the PACF plot at lag k, consider p=k and q=0.

→ If there is a sharp drop in the PACF plot after lag k and a significant spike in the ACF plot at lag k, consider p=0 and q=k.

Step 4: Identify the seasonal components of the model

· S is equal to the ACF lag with the highest value (typically at a high lag).

· D=1 if the series has a stable seasonal pattern over time.

· D=0 if the series has an unstable seasonal pattern over time.

· Rule of thumb: d+D≤2

· P≥1 if the ACF is positive at lag S, else P=0.

· Q≥1 if the ACF is negative at lag S, else Q=0.

· Rule of thumb: P+Q≤2

· Also, P-Q can be identified when there is a significant spike at S-lag in the PACF-ACF (repectively). Example for monthly data, where S=12 (because there are 12 months in a year):

Step 5. Investigate exogenous variables

You can try to add more exogenous variables when provided and measure the effectiveness of your model. Keep in mind, though, that they will not always be useful at all.

Step 6. Model Evaluation

We can evaluate an ARIMA/SARIMAX model with Diagnostics for Residuals.

(residual is the difference between true and predicted value)

A good forecasting method will yield residuals with the following properties:

The residuals are uncorrelated. If there are correlations between residuals, then there is information left in the residuals which should be used in computing forecasts.
The residuals have zero mean. If the residuals have a mean other than zero, then there are underlying patterns in the data.
The residuals have constant variance.
The residuals are normally distributed.

Additional tip!

How to decide differencing when looking at the autocorrelation plots:

If the ACF and PACF do not tail off, but instead have values that stay close to 1 over many lags, the series is non-stationary and differencing will be needed. Try a first difference and then look at the ACF and PACF of the differenced data.
If the series autocorrelations are non-significant, then the series is random (white noise; the ordering matters, but the data are independent and identically distributed.) We are ok at that point.
If first differences were necessary and all the differenced autocorrelations are non-significant, then the original series is called a random walk and we are ok. The data is dependent and is not identically distributed; actually, both the mean and variance are increasing through time.

Conclusion

In this article I described briefly some key concepts around ARIMA models for Time Series forecasting. We didn’t get much into mathematical details, instead the goal was to provide some steps and “workarounds” that I have learnt after working on multiple projects. I hope they seem useful for you, too.

Remember that these workarounds should serve as a starting point for your analysis, most often you need to try multiple combinations of parameters to find the correct ones. And some other times, you may need to turn to another model entirely, because ARIMA-family won’t work well.

In the next articles, I will present coding examples (in Python) of ARIMA models for Time Series, each of them focusing on different aspects of the modeling phase.

You can also read my original article published on Medium:

Time Series Episode 0: Familiarize with ARIMA and select parameters | by Vasilis Kalyvas | Nov, 2023 | Python in Plain English (medium.com)

Thanks for reading!

要查看或添加评论，请登录

Vasilis Kalyvas的更多文章

The easiest AI agent you will ever create!

2025年1月30日

The easiest AI agent you will ever create!

Can it get simpler than this?? Introduction Make it simple, they say. And that’s what I am about to do.

4 条评论
Time Series Episode 7: “Darts” with covariates

2025年1月26日

Time Series Episode 7: “Darts” with covariates

Learn how to add external variables in your Darts algorithms Introduction Hi there! Happy to see you again in this…
Time Series Episode 6: Battle of forecasting algorithms in “Darts”

2024年10月8日

Time Series Episode 6: Battle of forecasting algorithms in “Darts”

Introduction Hi there! Happy to see you again in this series of articles, where we discuss about Time Series, theory…
Time Series Episode 5: Getting Started with “Darts”

2024年6月15日

Time Series Episode 5: Getting Started with “Darts”

Introduction Hi there! Happy to see you again in this series of articles, where we discuss about Time Series, theory…
Time Series Episode 4: Can you trust Auto-ARIMA?

2024年6月1日

Time Series Episode 4: Can you trust Auto-ARIMA?

Introduction Hi there! Happy to see you again in this series of articles, where we discuss about Time Series, theory…
Time Series Episode 3: ARIMA Forecasting with exogenous variables

2024年5月18日

Time Series Episode 3: ARIMA Forecasting with exogenous variables

Introduction Hi there! Happy to see you again in this series of articles, where we discuss about Time Series, theory…
Time Series Episode 2: What happens with strong seasonality

2024年5月5日

Time Series Episode 2: What happens with strong seasonality

Introduction Hi there! Happy to see you again in this series of articles, where we discuss about Time Series, theory…
Time Series Episode 1: How to select the correct SARIMA parameters

2024年4月27日

Time Series Episode 1: How to select the correct SARIMA parameters

Introduction Hi there! Happy to see you again in this series of articles, where we discuss about Time Series, theory…

See all articles

Time Series Episode 0: Familiarize with ARIMA and its parameters

Vasilis Kalyvas

Senior Data Scientist at Coca-Cola HBC | AI/ML online articles & tutorials

Introduction

1. ARIMA model

2. SARIMAX model

3. Stationarity

4. ACF — PACF plots

4.1. ACF (AutoCorrelation Function) Plot:

4.2. PACF (Partial AutoCorrelation Function) Plot:

领英推荐

5. Tips and tricks for selecting parameters

Conclusion

Vasilis Kalyvas的更多文章

社区洞察

其他会员也浏览了

A Comprehensive Guide to Logistic Regression in Handling Outcomes

Concise Basic Stats Series - Part XI: Introduction to Time Series

Review, "Causal Inference in R" (Packt, 2024)

Overfitting in Regression Models

How to deal with Multicollinearity?

The Evolution of Data Analysis: From Straight Lines to More Sophisticated Models ??

Application of Logistic Regression with LASSO regularization to predicting March Madness Results

How logistic regression can save the day?

Machine Learning Unveils House Price Predictions!

Correlation plots in?R

Introduction

1. ARIMA model

2. SARIMAX model

3. Stationarity

4. ACF — PACF plots

4.1. ACF (AutoCorrelation Function) Plot:

4.2. PACF (Partial AutoCorrelation Function) Plot:

领英推荐

5. Tips and tricks for selecting parameters

Conclusion

Vasilis Kalyvas的更多文章

The easiest AI agent you will ever create!

Time Series Episode 7: “Darts” with covariates

Time Series Episode 6: Battle of forecasting algorithms in “Darts”

Time Series Episode 5: Getting Started with “Darts”

Time Series Episode 4: Can you trust Auto-ARIMA?

Time Series Episode 3: ARIMA Forecasting with exogenous variables

Time Series Episode 2: What happens with strong seasonality

Time Series Episode 1: How to select the correct SARIMA parameters

社区洞察

其他会员也浏览了

A Comprehensive Guide to Logistic Regression in Handling Outcomes

Concise Basic Stats Series - Part XI: Introduction to Time Series

Review, "Causal Inference in R" (Packt, 2024)

Overfitting in Regression Models

How to deal with Multicollinearity?

The Evolution of Data Analysis: From Straight Lines to More Sophisticated Models ??

Application of Logistic Regression with LASSO regularization to predicting March Madness Results

How logistic regression can save the day?

Machine Learning Unveils House Price Predictions!

Correlation plots in?R