TIME SERIES FORECASTING APPROACH

TIME SERIES FORECASTING APPROACH

Hi All there !

I am Saurav Kumar "Data Science Student and Ex-Software Engineer".

Writing a article for the folks out there who all are struggling to find out the best approach for Time Series Forecasting approach.

Everyone wants to know the future and as a data scientist we can make it possible. Time series analysis represents a series of time-based orders, which can be months, days, hour or seconds. Things we can check are weather forecasting models, stock market and many more.

LETS START!!!

What is Time Series Analysis?

It is the analysis of data collected over the time. And this data should be of consistent intervals, over a set period rather than choosing data points intermittently or randomly. The selection between time series analysis and conventional predictive modelling like linear regression, decision trees, random forests and neural networks is contingent upon the characteristics of the available data. When it comes to figuring out the complexities of temporal sequences, time series methods come out on top.

Before we dive into how we can do time series analysis I would like bring attention on few key points about Time series which are as follow.

Components of Time Series

There are 4 components of time series

  • Trend: A long-term increase or decrease in the series over a period of time.
  • Seasonal: Regular and predictable changes that recur every calendar year.
  • Cyclical: Fluctuations that occur over a longer period of time, essentially more than a year.
  • Irregular: Unsystematic, short term fluctuations. We will discuss later about how to generate and identify these components.

Data Type of Time Series

There are two types

  • Stationary - Our data is stationary when the mean and variance of our data is constant with respect to time.
  • Non-Stationary - When we have fluctuating mean and variance of our data points over the time then our data is non-stationary.

We will discuss this later on how to check data type of any time series.

How to Analyze?

From this point we will follow one by one step. Here I will be using Economic indicator data set obtained from https://data.boston.gov/

This data set contain the columns which is consider as important factors contributing toward the economy of Boston.

Our aim is to understand the basic method which can be followed to understand how we can find some meaningful analysis. It does not matter which column you will choose because column of we can say factor has contribution toward Boston economy.

Process we will follow is as follow

  • Data Collection and cleaning
  • Visualization of data with respect to time
  • Check the Stationary and Non-Stationary factor
  • Develop the ACF and PACF chart to understand the nature
  • Model Building
  • Conclusion from the insights

1) Data Collection and Cleaning

Import the data into data frame

df = pd.read_csv('Economic.csv')          

Create a new column using date time function in pandas. you can refer below site

df['Date'] = pd.to_datetime(df[['Year', 'Month']].assign(DAY=1))        

At this point you can make a different dataframe and store your datetime column and the desire column at one place.

2) Visualization of data with respect to time

You can generate simple graph with respect to time series.

plot_time = df.plot(x='Date', y='hotel_avg_daily_rate', figsize=(12,5))        
Fig- 1

I have used "hotel_avg_daily_rate" column, you can use any column and check. This plot is giving the idea that my data is seasonal and it is increasing. For more clarification we can plot "Decomposition".

Decomposition- It breaks the plot into several components which we discussed above and give the clear picture.

result = df[['hotel_avg_daily_rate']].copy()
decompose_result_mult = seasonal_decompose(result, model="multiplicative")
trend = decompose_result_mult.trend
seasonal = decompose_result_mult.seasonal
residual = decompose_result_mult.resid
decompose_result_mult.plot();         
Fig-2 Decomposition Graph

From above plot it is clear that my data is seasonal and it has increasing trend. Also we will get the residual points where our data deviated little bit but overall it looks fine.

3) Check the Stationary and Non-Stationary factor

There are multiple ways to check Stationary and Non-Stationary factor, out of which we will check the mostly used methods i.e.

  • Augmented Dickey-Fuller (ADF) Test

i) Augmented Dickey-Fuller (ADF) Test- This test check the null hypothesis that a unit root is present in time series data.

Let me simplify for you. This test will give you statistic and a p-value. We will check p-value whether it is less then 0.05 and then compare the statistic value with critical value at 1%, 5% and 10%. If the test statistic is more negative than the critical value, you can reject the null hypothesis and conclude that the time series is stationary.

We can perform ADF test through below piece of code

def adf_test(timeseries):
    print ('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)        

From the hotel example we got below result

Fig-3 ADF Test

this shows that test Statistic is more negative then 1%, 5%, 10% Critical value and p-value is also very low. Hence we can conclude that given time series data is stationary.

But what to do if data is Non Stationary?

Let me remind you the concept of Non-Stationary data which is fluctuating mean and variance. So we will try to make it constant. in technical terms we will say Transformation of data. There are multiple ways to make a data stationary and we will discuss the widely used method i.e.

  • Differencing Transformation

Differencing of a time series in discrete time is the transformation of the series to a new time series where the values are the differences between consecutive values. This procedure can be applied consecutively more than once, giving rise to the “first differences”, “second differences”, “third differences”.

In simple mathematical words

d(x) = x(t) - x(t-1)        

We can use below code to perform by placing the desired value of shift

df['x'] = df['x'] - df['x'].shift(1)        

Note - Multiple differencing is not suggested because differencing will result in loss of data.

So far we have prepared our dataset which we will use to perform time series analysis.

4) Develop the ACF and PACF chart to understand the nature

Now the data is prepared, so we are good to go and plot two main graphs which are

  • Autocorrelation Function (ACF)
  • Partial Autocorrelation Function (PACF)

We will discuss main components of the plot, what we can conclude on surface level how we can use the data in further analysis.

1) Autocorrelation Function (ACF) - Its a graphical representation of how a time series correlates with itself at different lags. In simple terms if we increase the lag from 1 to n then how our time series form 1 to n will correlate with time series at 0. It helps to identify AR and MA parameters for the ARIMA model. Lets see how we can plot ACF plot and how it will looks like. ACF plot usually give the idea of MA model.

We can achieve ACF plot by below piece of code

fig = plot_acf(df['hotel_avg_daily_rate'], lags=50)
plt.show()        

In the above piece of code you can change the lags based on the number of inputs you have and how many lags you want to display.

ACF plot will look like Fig-4

Fig- 4 ACF Plot

NOTE- Its not hard and fast rule that ACF plot will look like sinusoidal wave as it is represented in above image it can be of any shape.

We will diagnose the plot and understand all the basic key points about this plot.

1) Number from 0 to 50 on x-axis represent the number of lags mentioned in our code.

2) Y-axis indicates the value of the correlation (between ?1 and 1).

3) There can be positive and negative both kind of correlation

4) Blue shaded part represent the non-significant area i.e. if the point lies inside the blue curve then that lag will be consider as Non-significant and the time series has no correlation with itself at that lag.

5) Points outside the blue curve are known as Significant points.

6) Lag at 0 will always be 1 because lag at 0 i.e. "no lag", will always show the 100% correlation with itself.

What we can conclude from the plot?

1) We can say that as lag increases, its correlation with itself is decreasing.

2) There are 5 significant points in the plot which are 1, 11, 12, 13 and 25 which can be used as parameters for AR and MA in our ARIMA model.

3) In our case we got diminishing sinusoidal wave and this represent that our data will be best fit with AR model and not with MA model.

Note- You will not get the sinusoidal wave always so this conclusion is only for this graph.

2) Partial Autocorrelation Function (PACF)- It measure the correlation between a time series and lagged versions of itself. In simple words it will consider the lags smaller then itself when it will check for the correlation with itself. It is best used to check AR model parameter again I will say that its not hard and fast rule.

Code for PACF plot

pacf_plot=plot_pacf(df.hotel_avg_daily_rate)
plt.show()        
Fig-5 PACF


The way you will look this graph will be same as ACF and this will give the parameters for AR model which can be 1, 11, 12, 13.

Refer this for more information on ACF and PACF.

5) Model Building

We are almost there now, we have everything to build our model. We will discuss AR model and ARIMA model because these are widely used and if you are beginner then this will fulfil your purpose, but there are lot of other models which you can explore.

AR Model- Autoregressive model is a statistical model which will use past value of Y to predict the future value of X. It is similar to linear regression but not equal.

First step is to distribute the data set into training and test for that we will extract the values from the desire column and distribute the data in the ratio of 80:20, you can choose any ratio but be cautious that training should not be too less.

X = df['hotel_avg_daily_rate'].values
train, test = X[1:len(X)-13], X[len(X)-13:]        

Once the data is distributed into train and test we will create the model.

model = AutoReg(train, lags=12)        

Note- You need to check the best lag for your model for which you can perform different different test. One such test is LLR test which will give you stats value and help you to determine on which lag our model will work best.

Once the model is created we will fit the model.

model_fit = model.fit()        

Now we will create the prediction.

predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)        

Now that we have train, test and prediction we will print the values and plot the points, along with the MSE to check the error.

mse = mean_absolute_error(test, predictions)   
print("Mean Absolute Error",mse)
for i in range(len(predictions)):
 print('predicted=%f, expected=%f' % (predictions[i], test[i]))
rmse = sqrt(mean_absolute_error(test, predictions))
pyplot.figure(figsize=(14, 6))
pyplot.plot(range(len(train)), train, color='blue', label='Train')
pyplot.plot(range(len(train), len(train) + len(test)), test, color='green', label='Test')
pyplot.plot(range(len(train), len(train) + len(test)), predictions, color='red', label='Predictions')
pyplot.legend()
pyplot.show()        

This will give us

Mean Absolute Error 9.070841766441994
predicted=181.596546, expected=179.400000
predicted=155.760302, expected=170.220000
predicted=164.633375, expected=175.290000
predicted=223.276962, expected=225.100000
predicted=283.494022, expected=287.720000
predicted=305.864424, expected=312.010000
predicted=302.325155, expected=314.970000
predicted=279.699210, expected=285.020000
predicted=283.492716, expected=270.540000
predicted=316.591831, expected=312.370000
predicted=332.389502, expected=313.170000
predicted=261.973556, expected=238.820000
predicted=182.299757, expected=183.200000        
Fig-6 Test, Train and Prediction plot

For future forecasting we have to define the time for which we need the prediction and the future dates with future prediction points. This can be achived by below code.

future_dates = date_range(start=df.index[-1], periods=forecast_steps + 1, freq='M')
future_predictions = model_fit.predict(start=len(train), end=len(train) + forecast_steps, dynamic=False)         

Which can be represented as below

Fig-7 Future Plot

ARIMA Model- AutoRegressive Integrated Moving Average

  • AR: Auto-regression: Equation terms created based on past data points
  • I: Integrated: Differences and errors
  • MA: Moving Average: Trends and seasonality of the data

Flow for creating the model will be same as AR model but here we need to identify AR parameter(p), MA parameter(q) and degree of differencing(d)

  • p: Order of the autoregressive (AR) component.
  • d: Degree of differencing (integration).
  • q: Order of the moving average (MA) component.

Once we have identified the p, d, q at which our model will work best then we will prepare our model and fit the model and get the prediction points.

X = df['stationary_data'].values
train, test = X[1:len(X)-13], X[len(X)-13:]
order = (12, 0, 12) 
model = ARIMA(train, order=order)
model_fit = model.fit()
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, typ='levels')        

In same way we can plot the graph

6) Conclusion from the insights

Finally we are at the stage where we will convey our findings to non-technical audience. And this is the final step for which everyone cares as far as it is related to business.

From above predicted points and test points we can conclude that

  • Our prediction is almost correct because our MSE is very low and the predicted points are very close to test data.

From graph we can conclude that

  • There is a uniform increase in the price of the hotels based on yearly data, where the variables increase by a small quantity year after year which will be near to $(181, 155, 164, 223, 283, 305, and 302). Future rate will follow the seasonality trend until and unless something major happen in Boston.

Sharing my GitHub for reference where you will find the similar steps followed to do time series analysis but the analysis happened in this GitHub contain multiple factors and different findings.

GITHUB

REFERENCE

BOOK - Time Series Forecasting in Python by Marco Peixeiro

https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

https://www.kaggle.com/code/iamleonie/time-series-interpreting-acf-and-pacf

Google Search


Time series prediction can also be achieved by Neural Network via a RNN (Recurrent Neural Networks), which has a complete different procedure. I hope I have conveyed the basic idea on how to start the analysis in time series. Please feel free to discuss more.


Diving into time series analysis can be quite the journey, and it's great to see you're equipping yourself with a robust toolkit! ??? Generative AI can significantly streamline your forecasting process, enhancing the precision of your predictions and saving you valuable time. By leveraging AI, you can automate and refine complex analyses, allowing you to focus on strategic insights and decision-making. ?? Let's explore how generative AI can elevate your work in time series analysis and forecasting. Book a call with us to unlock new efficiencies and capabilities! ?? Cindy

回复

要查看或添加评论,请登录

Saurav K.的更多文章

社区洞察

其他会员也浏览了