Modeling and Predicting Demand During Pandemics using Time Series Models
Srihari Jaganathan
Utilizing analytics for the betterment of patient care | Servant Leadership | Problem Solver
1. Motivation:
Key question that practitioners in analytics and forecasting are going to face in the upcoming months is how to incorporate COVID - 19 pandemic information that may have impacted their demand data. In order to generate accurate forecast, and explain the behavior of the impact one has to accommodate the underlying dynamics of COVID-19 in the statistical modeling process.
The main motivation for writing this article is four fold:
- Provide a real world example to demonstrate how one could account for COVID-19 pandemic data in the context of time series forecasting and regression analysis.
- Provide simple tools and techniques such as ARIMA and Exponential Smoothing that one could use to explain the impact of COVID-19.
- Provide demand planners/forecasters, data scientists, and statisticians another tool or approach to make better modeling decisions and predictions.
- Provide all the programs in SAS and R in order for the reader to replicate the analysis and apply it at their work.
I'm planning to write a three part series. This is Part I, Part II will be concerned with outlier detection and correction on the same data, finally in Part III will deal with communication and story telling (if time permits). All codes for creating figures and as well as analysis have been posted in my GitHub account and referenced throughout the article. We will use intervention and interrupted through out the article which indicates events like pandemics, SARS in our example.
2. Data:
This article uses real world data derived from this paper. The data consists of monthly tourist demand from Japan to Taiwan in the time period between January 2001 thru July 2005 (see right). SARS pandemic hit Taiwan in March 2003. SARS impacted Taiwan three months of April, May and June 2003. Taiwan was removed from SARS affected countries in July 2003 and the recovery began.
3. Shape of the Impact:
It is extremely crucial for an analyst to have a hypothesis (an idea) on the shape of the impact and/or recovery, if not any analysis without a firm hypothesis would lead us in the area of data dredging. See figure on the right, as you could see, we can have a permanent impact with gradual or sudden drop to a new level. You could also have a temporary impact with slow or sudden recovery back to baseline. This would provide you an idea of the type of impact that you would expect to see in your data.
You could also have a combination of the above two which is called a compound effect. See right figure, where you have a gradual decline after intervention such as pandemics, then plateauing effect and followed by an exponential increase. This type of modeling is called interrupted time series analysis and is covered in great length in the book Interrupted Time Series Analysis.
So what would the shape of the impact for this case study? In looking at the Taiwan tourism chart above, one can observe that the demand dropped precipitously for the months of April, May and June 2003 post first patient being diagnosed with SARS, then beginning July 2003 once Taiwan was removed from SARS affected countries you could see a slow growth and the demand appears to be at previous level in late 2004. This can be modeled using a compound effect, where April, May and June is a level shift (permanent effect with sudden drop) and starting July 2003 you see a gradual temporary effect.
4. Time Series Modeling of Demand with ARIMA Linear Transfer Function and Exponential Smoothing:
There is two schools of thought in data analysis:
- Fitting data to a model commonly called as "Data Fitting"
- Modeling the underlying time series or stochastic process.
In this article we will focus on #2 for many reasons. Time series process can be modeled as follows:
- Deterministic components could be explanatory variables and outliers.
- Stochastic component would be ARMA structure and normally distributed error term.
4a. ARIMA Linear Transfer Function
There is a class of time series statistical model called Linear Transfer Function (LTF) within ARIMA framework. Extensive treatment of this technique is provided in the book Time Series Analysis: Forecasting and Control. Its math heavy book for interested readers, but in this article we will focus on intuition and software implementation.
Before we get into the details, within the LTF framework, there is a first order transfer function called Koyck transformation which is given by equation(see right). The numerator provides the level of drop one would observe, and the denominator delta controls the recovery. This is the most important transfer function model that would be useful to model any intervention in time series analysis. A general comprehensive transfer function model is given in the Appendix which can account for very complex impact and recovery patterns. However, in my view Koyck transformation is sufficient for 80% of the problems that arises in time series analysis. I(t) can be pulse or a step intervention.
The above function can be visualized with following figure. Numerator (omega) here is -100, depending on different delta, we can see how soon the recovery happens. So for example if delta is 0 then there is only a spike down every thing comes back to normal on the other hand, when delta is 1 there is no recovery. If delta is 0.9 we see a slow recovery on the other hand when delta is 0.25 the recovery s very quick. R code to generate the figure on right can be accessed here.
Coming back to our problem, we can model the Taiwan Tourism demand accounting for the compound intervention effect due to SARS as a level shift and temporary impact with a gradual increase. It is a remarkable approach in that with just three variables, you could capture the lag effects of the interruption due to SARS pandemic parsimoniously. With regression you would need may be 15 or more dummy coded variables to account for the recover effect which using Kocyk transformation we could do it with 2 variables!
The modeling approach for LTF/Arima requires a three step process:
- Identify the time series process before the intervention (SARS) impact, in this case it is data before SARS (before April 2003). The time series process that well explained this series is as AR(1) with constant level, no trend, or seasonality .
- Add the intervention effects at appropriate time periods. in this case it is level shift and temporary change.
- Check residuals to ensure that it is normally IID distributed and free of outliers.
Natural question is how would one solve the above transfer function model. In my personal experience only commercial software such as #SAS has the flexibility and robustness to model the above equation. See right the output from SAS program. I have recently come across an R package that is very flexible in transfer function, may be in the future I'll update my GitHub repository to model using an R package. You can access the SAS code here in my GitHub. Plugging all the values in the above equation gives the following:
Any good modeling process should have a hold out dataset, in order to test the accuracy of the methods employed. We decided to hold out last 8 periods. Prediction were made for the last 8 periods using the LTF/Arima approach presented. Visual inspection shows that the predictions are reasonably accurate with all data lying in the 95% prediction interval. Also included in the chart below is the visualization of the compound impact. What we can see is that Taiwan permanently lost ~450K tourists from Japan due to SARS during the time period of April 2003 thru end of Q4 2004. R code used to generate the graphics above can be accessed here.
4b. Modeling using Exponential Smoothing:
Exponential smoothing is considered a workhorse of time series analysis. Until very recently there has not been a way to incorporate regressors in the exponential smoothing framework. Thanks to Ivan Svetunkov smooth package in R, we can now incorporate regressors.
We could easily incorporate level shift but how about estimating delta in Koyck transformation?.
tc <- filter(1 * (seq.int(length(taiwan.ts)) == 31), filter = delta[i], method = "rec", sides = 1)
R has a native function called filter that can be used to model Koyck transformation. This is a workaround. There is not an implementation of LTF/Arima like approach within exponential smoothing framework to estimate transfer functions. A regressor was created with varying delta values (0.6 to 0.99) using filter function in R and applied to exponential smoothing. Akaike Information Criteria (AIC) was also stored for each of these delta values. See chart above that compares AIC for different delta values. Delta was chosen based on the lowest AIC (best model is the one that has lowest AIC) , in this case it is 0.93 (vs. 0.88 in LTF/Arima). This approach gives us an option to use Exponential smoothing for intervention analysis and gives analyst an additional method in his/her's toolkit. One caveat is that it becomes practically intractable and computationally expensive if we have multiple transfer functions where numerous delta values has to be estimated simultaneously. However in this case it worked really well since we had one delta value!. In practice this would be sufficient.
Figure on the right shows the fitted vs actual data using exponential smoothing from R smooth package. The model chosen here is ETSX(ANN) with level shift of -48810.46 and temporary effect of -37762.17 and delta is 0.93 It also shows the predicted value and the hold out dataset similar to LTF. Visual inspection shows very good accuracy of predicted data compared to holdout data. In addition, chart below shows, the impact of SARS on tourist demand. What we can see is that Taiwan permanently lost ~590K tourists from Japan due to SARS. R code to generate the models and also the charts can be accessed in my GitHub account here.
4C. Model Selection:
So what model to choose or believe? Model selection is a tough subject, you could select model based on hold out performance or model that has can be easily implemented. Alternatively, you could combine (ensemble) forecast from diverse methods which has been shown to be very effective empirically. See above table which shows mean absolute error on holdout performance of LTF/Arima, Exponential Smoothing and the combination of the two. As you can see for prediction purpose combining or ensembling has a clear advantage and has significantly lower error than individual models. However when explaining the result, in the business context you could pick any one of the model since both have similar performance.
5. Conclusions
In this article, I shared techniques and methods to model pandemic data in the context of forecasting demand to explain its impact and build accurate predictive models. Linear Transfer Function/Arima framework provides best way to model the interruption in the data due to pandemics, accounting for complex relationships in a parsimonious way. Exponential smoothing is another method to model pandemics, provided the impact modeled is simple. When possible we could combine models for better accuracy although it diminishes the expandability of the impact. Hope this article motivates the reader to apply these methods in their day to day job to model the impact of pandemics in time series context. As noted earlier, all codes have been posted in my GitHub account to do the analysis and is referenced through out the article. Let me know in the comments below if you have questions or feedback.
6. References
Modeling and learning linear transfer functions within ARIMA context is not a trivial. However once mastered, it provides the greatest flexibility of any time series approaches out there to model events in time series data. I learnt most of the linear transfer function in three books. (1) is great with very good references. (2) is dedicated to interrupted time series modeling. (3) goes in depth on using #SAS software to model using software. In addition , (4) seminal work by Box and Tiao that started intervention modeling is an important work for everyone to read. (5) Explains difference between ARIMA, ARIMAX, and Transfer function models.
- Forecasting Methods and Applications by Makridakis, Wheelright and Hyndman.
- Interrupted Time Series Analysis by McDowall, McCleary and Bartos
- Forecasting Using SAS? Software: A Programming Approach
- Intervention Analysis with Applications to Economic and Environmental Problems. G. E. P. Box and G. C. Tiao
- ARIMAX Muddle by Rob Hyndman
7. What's Next?
In the next article I'll cover automatic outlier detection to diagnose and model outlier observation to build even more better models that explains the underlying time series process much better. We expect this would become prevalent due to pandemics.
Everything expressed here are mine and do not necessarily reflect the views of my employer.
Director - Big Data & Data Science & Department Head at IBM
1 年?? Ready to embark on a data safari? Explore www.analyticsexam.com/sas-certification for practice exams that will guide you through the wilds of SAS Certification. Become a data explorer and conquer the data jungle! ???? #SASCertification #DataSafari #CareerAdventure #SuccessExpedition ????
Biostatistician
3 年i would love if you know how to model a drift term to a transfer function model in R. I know the "Arima" function in the forecast package has an option to add drift but not in the "Arimax" function in TSA which is used for transfer function analysis.
Fulltime Curious: oa in rollen als Spreker, Dagvoorzitter, Docent, Onderzoeker en Toezichthouder
3 年Jolanda Luime
Professor at University of Nicosia
3 年Great/relevant work made available at the right time. Congratulations Srihari and thanks for putting it to GitHub.