How Machine Learning is transforming Supply Chain( Part-3)- ARIMA( Fremont Bridge Seattle Cycle count)
Ravi Prakash
Senior Manager , Planning and Business Systems , Johnson and Johnson , APAC , MedTech
What is all in this article ?
Data
The Fremont Bridge Bicycle Counter ( Pic above ) began operation in October 2012 and records the number of bikes that cross the bridge using the pedestrian/bicycle pathways. I?have never been to this bridge but many experts in ML world ( Introduction to Machine Learning with Python: Andreas and Sarah, Python Data Science Handbook: Jake Vanderplas) have looked at this data to explain the complexity involved in ML ?. Obviously what we will do here would be different ! You can access this data from below link –
I would like to thank the city authorities to make this data available for ML enthusiasts like me .
Approach and Pre-requisites
Unlike previous articles in this series , I would keep the Python codes visible so that you can follow along . You must have Jupyter notebook ( or any similar notebook ) installed in your PC . If not then browse on internet and you can get it free ! Last requirement is time commitment . ARIMA is not simple like smoothing models so it would be long article which I would try to keep concise . That is it ! Are you ready learn something new ?
Getting into the MAZE !
Downloaded file when read through Pandas gives few basic details .
We clearly see that data is captured in hourly buckets . There are two side walks . We will be interested in total number . A simple plot of data ( you can do it yourself ) confirms that we have records since 2012 . A demand planning mind would also observe seasonality ( which we explore in details later) and decline in trend beginning 2020 . Decline is on account of Covid but there is more interesting details hidden . Let us move forward to discover it .
Above chart is based on hourly data . What about if weekly sum plotted for entire horizon ? In order to do so we will have to resample the data and then plot it .
As we start rolling up data to lower frequency, we can clearly get many insights . Looking at monthly data , it appears like despite Covid cases stabilizing Vol is nowhere where it used to be before Covid ! We cannot compare 2022 with 2021 .
Will it help if we look at weekdays pattern ? I mean how many people on average pass through this bridge on Monday , Saturday etc ? Let us look at 2 charts ( Pre-Covid and during Covid )
Very interesting charts . Pre-Covid , weekends witnessed low traffic compared to working days signaling the fact that many office goers opted for cycling to travel to work centers ( This is my assumption which we will prove or reject through data ) . During Covid as such there is no such stark difference but three days ( Tue , Wed and Thursday ) are having higher number of people using cycles . In US , most corporates have made it mandatory to work 3 days from office and it appears like salaried workers are opting to be in office on these 3 weekdays !
How about hourly split ( again Pre and During Covid ) ?
Hourly pattern has not changed significantly ! Still many people pass through the bridge during early rush hours and in evening around 5pm . It is fair to assume that because of WFH and 3 days dictated presence in office , we have slight different trend during the Covid. As a city dweller if I need to avoid traffic congestion then hourly plot can be my guide .
Is there any monthly pattern ? We know from above charts that it is there but not obvious .
领英推荐
People use cycling more often in summers . You can try pre and during Covid graphs yourself .
Let us stop here . There is no end to data analytics and we have already got enough insights to embark on our 2nd leg of long journey . You will be scratching your head and complaining that objective was to learn ARIMA but where is it ? Understanding data is key to modeling . Trust me if we directly jump to ARIMA then it would become even more challenging to unlock it .
Introducing ARIMA ..
ARIMA stands for autoregressive integrated moving average. Each of these terms have some hidden meaning and which is key to understand the ARIMA model itself .
AR( Auto Regressive) + Integrated + (Moving Average) MA ==ARIMA
Before we get into details of AR and MA models , it is important to tell at very onset that it is a linear model . If you have studied math in high school then Linear model is something which you should be familiar with . If not then let us quickly recap it . Ticket price of flight depends on many factors ( Like route (R) , seasons of month(S) , week day(D) , flight operator(F) , class (C), duration of Journey(H) etc ) . We can safely assume that Price at a given time t is linear function of these variables and some error term ( normally distributed ) .
Price(t) == beta + alpha_1*R +alpha_2*S +alpha_3*D + alpha_4*F +......+e
It may be that Linear model does not match the actual function which can represent the distribution of Price data but these are very powerful models . By transforming features ( SK learn Polynomial features) , we can even create non-linear decision boundaries . Anyway , let us get back to ARIMA because that is a topic which itself require separate discussion .
AR Model- It anticipates series dependence on its own past values . What does it mean ?AR models regress on actual past values. Allow me to put a mathematical expression and I promise it is very simple one . So we say that for first order AR model (?AR(p=1))?formula looks like this -
??(?? )=??0 + ??1y(???1)+??
y(Predicted) = E(y(t)) # E stands for expected value y(Predicted) = ??0 + ??1y???1
If AR model considers just pervious observation ( Lag 1) then it is of order 1 ( denoted universally by p) . Dependence on last 2 previous values would mean that order is 2 ( p=2).
??(??) =??0 + ??1??(???1)+??2??(???2)+??(??)
So above formula factors in 2nd order AR(p=2) model . Similarly 3rd order considers last 3 values and so on. It is to time to see this in action .
Step-1 -Creating a sample data ( AR model with p=1)
Basically below formula is implemented with NumPy .
??(?? )=??0 + ??1y(???1)+??
We have generated 1000 data points where ??1 is 0.7 . We have also introduced some random noises . Looking at graph , can we say that it is actually not a random walk rather an AR model with p=1!
Step-2 - Given this distribution of data and also the fact that it is generated by AR model with p=1 , let us use statsmodels ARIMA model to find the values of ??1.
Wow! What is this . Let us break it down so that you can understand the relevant details at this point whereas we revisit some of these terms in later part of this article . When we apply ARIMA model then we have to provide the values of p( order of AR model, which you should understand now ) , d ( number of times series should be differentiated to make it stationary) and q( Order of MA model-to be explained later ) ( with assumption that there is no seasonality , if time series is seasonal then we need to also fill in Seasonal orders -P,D,Q and s, to be explained later) . Since we generated the data so we know that it is AR model with p=1 .
Summary say Model -ARIMA ( 1,0,0) . Nothing surprising as we input it . Next look at value of ??1( ar.L1 ==0.69) . Eureka , model has found almost exact value of ??1. Loglikelihood, AIC , BIC are important data which we learn to read later . As of know the message is Model is able to find the value of parameters of distribution function which we created ourselves .
In real world we would not know the value of p in advance so we should have a way to find it . One of tools available is PACF ( Partial Auto-Correlation Factor) . What is PACF and further details are beyond the scope of this article . Request you to refer online sources .
When plotted it looks like the below chart. The area highlighted in blue tells that if values are within this range then observed partial co-relation is not statistically significant . x axis is lags where as y represents the value of PACF . We can easily notice that only lag 0 and lag 1 values are significant which means that this time series is generated by p=1 AR model! Once we know it , we can go back to ARIMA model and update values of orders .
To summarize , what we learned here is - What are AR models , if TS data is given ,how can we find the value of order of p ( By using PACF ) . PACF may not always work but it definitely shows us the right direction . We will apply PACF to find order of p for Freemont cycling Time series data . If you want then you can generate 2nd order AR model yourself and check that ARIMA works or not . Is PACF helping to find order of p or not ! In next update we talk about MA model . It is not the moving average that you may be familiar with ....