TIME SERIES FORECASTING APPROACH
Hi All there !
I am Saurav Kumar "Data Science Student and Ex-Software Engineer".
Writing a article for the folks out there who all are struggling to find out the best approach for Time Series Forecasting approach.
Everyone wants to know the future and as a data scientist we can make it possible. Time series analysis represents a series of time-based orders, which can be months, days, hour or seconds. Things we can check are weather forecasting models, stock market and many more.
LETS START!!!
What is Time Series Analysis?
It is the analysis of data collected over the time. And this data should be of consistent intervals, over a set period rather than choosing data points intermittently or randomly. The selection between time series analysis and conventional predictive modelling like linear regression, decision trees, random forests and neural networks is contingent upon the characteristics of the available data. When it comes to figuring out the complexities of temporal sequences, time series methods come out on top.
Before we dive into how we can do time series analysis I would like bring attention on few key points about Time series which are as follow.
Components of Time Series
There are 4 components of time series
Data Type of Time Series
There are two types
We will discuss this later on how to check data type of any time series.
How to Analyze?
From this point we will follow one by one step. Here I will be using Economic indicator data set obtained from https://data.boston.gov/
This data set contain the columns which is consider as important factors contributing toward the economy of Boston.
Our aim is to understand the basic method which can be followed to understand how we can find some meaningful analysis. It does not matter which column you will choose because column of we can say factor has contribution toward Boston economy.
Process we will follow is as follow
1) Data Collection and Cleaning
Import the data into data frame
df = pd.read_csv('Economic.csv')
Create a new column using date time function in pandas. you can refer below site
df['Date'] = pd.to_datetime(df[['Year', 'Month']].assign(DAY=1))
At this point you can make a different dataframe and store your datetime column and the desire column at one place.
2) Visualization of data with respect to time
You can generate simple graph with respect to time series.
plot_time = df.plot(x='Date', y='hotel_avg_daily_rate', figsize=(12,5))
I have used "hotel_avg_daily_rate" column, you can use any column and check. This plot is giving the idea that my data is seasonal and it is increasing. For more clarification we can plot "Decomposition".
Decomposition- It breaks the plot into several components which we discussed above and give the clear picture.
result = df[['hotel_avg_daily_rate']].copy()
decompose_result_mult = seasonal_decompose(result, model="multiplicative")
trend = decompose_result_mult.trend
seasonal = decompose_result_mult.seasonal
residual = decompose_result_mult.resid
decompose_result_mult.plot();
From above plot it is clear that my data is seasonal and it has increasing trend. Also we will get the residual points where our data deviated little bit but overall it looks fine.
3) Check the Stationary and Non-Stationary factor
There are multiple ways to check Stationary and Non-Stationary factor, out of which we will check the mostly used methods i.e.
i) Augmented Dickey-Fuller (ADF) Test- This test check the null hypothesis that a unit root is present in time series data.
Let me simplify for you. This test will give you statistic and a p-value. We will check p-value whether it is less then 0.05 and then compare the statistic value with critical value at 1%, 5% and 10%. If the test statistic is more negative than the critical value, you can reject the null hypothesis and conclude that the time series is stationary.
We can perform ADF test through below piece of code
def adf_test(timeseries):
print ('Results of Dickey-Fuller Test:')
dftest = adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print (dfoutput)
From the hotel example we got below result
this shows that test Statistic is more negative then 1%, 5%, 10% Critical value and p-value is also very low. Hence we can conclude that given time series data is stationary.
But what to do if data is Non Stationary?
Let me remind you the concept of Non-Stationary data which is fluctuating mean and variance. So we will try to make it constant. in technical terms we will say Transformation of data. There are multiple ways to make a data stationary and we will discuss the widely used method i.e.
Differencing of a time series in discrete time is the transformation of the series to a new time series where the values are the differences between consecutive values. This procedure can be applied consecutively more than once, giving rise to the “first differences”, “second differences”, “third differences”.
In simple mathematical words
d(x) = x(t) - x(t-1)
We can use below code to perform by placing the desired value of shift
df['x'] = df['x'] - df['x'].shift(1)
Note - Multiple differencing is not suggested because differencing will result in loss of data.
So far we have prepared our dataset which we will use to perform time series analysis.
4) Develop the ACF and PACF chart to understand the nature
Now the data is prepared, so we are good to go and plot two main graphs which are
We will discuss main components of the plot, what we can conclude on surface level how we can use the data in further analysis.
1) Autocorrelation Function (ACF) - Its a graphical representation of how a time series correlates with itself at different lags. In simple terms if we increase the lag from 1 to n then how our time series form 1 to n will correlate with time series at 0. It helps to identify AR and MA parameters for the ARIMA model. Lets see how we can plot ACF plot and how it will looks like. ACF plot usually give the idea of MA model.
We can achieve ACF plot by below piece of code
fig = plot_acf(df['hotel_avg_daily_rate'], lags=50)
plt.show()
In the above piece of code you can change the lags based on the number of inputs you have and how many lags you want to display.
ACF plot will look like Fig-4
领英推荐
NOTE- Its not hard and fast rule that ACF plot will look like sinusoidal wave as it is represented in above image it can be of any shape.
We will diagnose the plot and understand all the basic key points about this plot.
1) Number from 0 to 50 on x-axis represent the number of lags mentioned in our code.
2) Y-axis indicates the value of the correlation (between ?1 and 1).
3) There can be positive and negative both kind of correlation
4) Blue shaded part represent the non-significant area i.e. if the point lies inside the blue curve then that lag will be consider as Non-significant and the time series has no correlation with itself at that lag.
5) Points outside the blue curve are known as Significant points.
6) Lag at 0 will always be 1 because lag at 0 i.e. "no lag", will always show the 100% correlation with itself.
What we can conclude from the plot?
1) We can say that as lag increases, its correlation with itself is decreasing.
2) There are 5 significant points in the plot which are 1, 11, 12, 13 and 25 which can be used as parameters for AR and MA in our ARIMA model.
3) In our case we got diminishing sinusoidal wave and this represent that our data will be best fit with AR model and not with MA model.
Note- You will not get the sinusoidal wave always so this conclusion is only for this graph.
2) Partial Autocorrelation Function (PACF)- It measure the correlation between a time series and lagged versions of itself. In simple words it will consider the lags smaller then itself when it will check for the correlation with itself. It is best used to check AR model parameter again I will say that its not hard and fast rule.
Code for PACF plot
pacf_plot=plot_pacf(df.hotel_avg_daily_rate)
plt.show()
The way you will look this graph will be same as ACF and this will give the parameters for AR model which can be 1, 11, 12, 13.
Refer this for more information on ACF and PACF.
5) Model Building
We are almost there now, we have everything to build our model. We will discuss AR model and ARIMA model because these are widely used and if you are beginner then this will fulfil your purpose, but there are lot of other models which you can explore.
AR Model- Autoregressive model is a statistical model which will use past value of Y to predict the future value of X. It is similar to linear regression but not equal.
First step is to distribute the data set into training and test for that we will extract the values from the desire column and distribute the data in the ratio of 80:20, you can choose any ratio but be cautious that training should not be too less.
X = df['hotel_avg_daily_rate'].values
train, test = X[1:len(X)-13], X[len(X)-13:]
Once the data is distributed into train and test we will create the model.
model = AutoReg(train, lags=12)
Note- You need to check the best lag for your model for which you can perform different different test. One such test is LLR test which will give you stats value and help you to determine on which lag our model will work best.
Once the model is created we will fit the model.
model_fit = model.fit()
Now we will create the prediction.
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)
Now that we have train, test and prediction we will print the values and plot the points, along with the MSE to check the error.
mse = mean_absolute_error(test, predictions)
print("Mean Absolute Error",mse)
for i in range(len(predictions)):
print('predicted=%f, expected=%f' % (predictions[i], test[i]))
rmse = sqrt(mean_absolute_error(test, predictions))
pyplot.figure(figsize=(14, 6))
pyplot.plot(range(len(train)), train, color='blue', label='Train')
pyplot.plot(range(len(train), len(train) + len(test)), test, color='green', label='Test')
pyplot.plot(range(len(train), len(train) + len(test)), predictions, color='red', label='Predictions')
pyplot.legend()
pyplot.show()
This will give us
Mean Absolute Error 9.070841766441994
predicted=181.596546, expected=179.400000
predicted=155.760302, expected=170.220000
predicted=164.633375, expected=175.290000
predicted=223.276962, expected=225.100000
predicted=283.494022, expected=287.720000
predicted=305.864424, expected=312.010000
predicted=302.325155, expected=314.970000
predicted=279.699210, expected=285.020000
predicted=283.492716, expected=270.540000
predicted=316.591831, expected=312.370000
predicted=332.389502, expected=313.170000
predicted=261.973556, expected=238.820000
predicted=182.299757, expected=183.200000
For future forecasting we have to define the time for which we need the prediction and the future dates with future prediction points. This can be achived by below code.
future_dates = date_range(start=df.index[-1], periods=forecast_steps + 1, freq='M')
future_predictions = model_fit.predict(start=len(train), end=len(train) + forecast_steps, dynamic=False)
Which can be represented as below
ARIMA Model- AutoRegressive Integrated Moving Average
Flow for creating the model will be same as AR model but here we need to identify AR parameter(p), MA parameter(q) and degree of differencing(d)
Once we have identified the p, d, q at which our model will work best then we will prepare our model and fit the model and get the prediction points.
X = df['stationary_data'].values
train, test = X[1:len(X)-13], X[len(X)-13:]
order = (12, 0, 12)
model = ARIMA(train, order=order)
model_fit = model.fit()
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, typ='levels')
In same way we can plot the graph
6) Conclusion from the insights
Finally we are at the stage where we will convey our findings to non-technical audience. And this is the final step for which everyone cares as far as it is related to business.
From above predicted points and test points we can conclude that
From graph we can conclude that
Sharing my GitHub for reference where you will find the similar steps followed to do time series analysis but the analysis happened in this GitHub contain multiple factors and different findings.
REFERENCE
BOOK - Time Series Forecasting in Python by Marco Peixeiro
Google Search
Time series prediction can also be achieved by Neural Network via a RNN (Recurrent Neural Networks), which has a complete different procedure. I hope I have conveyed the basic idea on how to start the analysis in time series. Please feel free to discuss more.
Diving into time series analysis can be quite the journey, and it's great to see you're equipping yourself with a robust toolkit! ??? Generative AI can significantly streamline your forecasting process, enhancing the precision of your predictions and saving you valuable time. By leveraging AI, you can automate and refine complex analyses, allowing you to focus on strategic insights and decision-making. ?? Let's explore how generative AI can elevate your work in time series analysis and forecasting. Book a call with us to unlock new efficiencies and capabilities! ?? Cindy