3 Things Investors Need to Know About Building Forecast Models
Scott Lee, CFA
Investment Strategist @ Central Bank of Malaysia | University of Cambridge | UCL
Introduction
Investors make asset allocation decisions based on a constantly evolving outlook of the future. For many large institutions, this process often starts with formulating projections of key macroeconomic indicators, such as GDP growth, unemployment, and inflation for the market of interest which, in turn, drive expectations for asset class returns.?To the extent that an investor believes their own forecasts can consistently beat the market consensus (big if, fair enough), it makes sense to invest resources into building better forecast models.
Since accelerating inflation has been a dominant investment theme this year, I thought it would be interesting to build a model to forecast the month-on-month (MoM) US CPI inflation using only high level macroeconomic and financial indicators. Using 40 years of monthly data spanning January 1981 - July 2022, I compare the forecast accuracy of traditional econometric models (AR, SARIMA & Linear Regression with SARIMA Errors) against more recently popularized regularization & machine learning models (Ridge, LASSO, Elastic Net & Random Forests).?
For anyone interested in getting started with time series forecasting with a focus on investment related applications, I use this article to document three reflections I've come to appreciate through working on this inflation model. TLDR; when building forecast models, I learned that it's important to:
TLDR2; If you're in a hurry - using the change in MoM Growth in US CPI as the target variable, the best performing model was a Linear Regression with SARIMA errors (2, 0, 1)(0, 0, 1). Here's a graph visualizing the out of sample forecast performance of the model using an expanding window approach from 2001-2022. The dark blue line represents the predictions, and the dotted light blue line represents the actual observed values.
Some implementation details
The target variable is the MoM % Change in US CPI gathered from the U.S. Bureau of Labor . I build 3 "traditional" econometric models - AR(1), SARIMA, and Linear Regression with SARIMA errors - as well as 4 regularization/machine learning models - Ridge, LASSO, Elastic Net and Random Forests regressions.
We start with a full list of 15 exogeneous feature variables summarized in Table 1. For Linear Regression, I manually select variables that are statistically significant at the 10% level. For all regularization/machine learning models, I use all 15 variables. All results shown are post hyperparameter tuning.
Make adjustments to your dataset to account for lags in data releases to avoid look ahead bias
Look ahead bias is the error when a model is trained on data when in fact, the data would not have been available at that point in time. To crystalize this, let's use our inflation forecast set up to illustrate the problem.
Suppose I wanted to predict the US Consumer Price Index (CPI) MoM inflation using the US Producer Price Index (PPI) MoM inflation because I believe there is a strong correlation between the two. When I extract the data from an external source (in this case Bloomberg), I receive a nice table with the CPI and PPI readings by the observation month. The row for June 2022 will contain both the CPI and PPI readings for the month of June. Easy right?
Not quite so. Digging deeper, we find that the June CPI figure is released with a 10 day lag on 10th July, while the June PPI figure is released with a longer 14 day lag on 14th July. Practically speaking, the latest day you'd need to send the forecast to your CIO or your clients is 9th July, so you don't actually have access to the June PPI figure to forecast June's CPI. Therefore, the correct approach would be to train a model that would explicitly use the May PPI figure to predict the June CPI figure i.e., with a one month lag.
In our case, CPI inflation is more correlated with same month PPI inflation (0.7) compared to the one month lagged PPI inflation (0.5). In modelling terms, if we fail to adjust for the release dates, we overstate the testing accuracy of the model because we wrongly allowed the model to "peep" into the future. The implication is two fold; First, our model doesn't work as well when we deploy it into production because it's been trained on the wrong data and second, come 9th July we don't actually have the data we need to generate the forecast anyway.
Thus, the takeaway here is that we need to inspect the data release dates and manually make adjustments to account for any lags before building the model. While tedious, this is a critical fix we need to apply to achieve a robust model that retains its forecast performance out of sample .
For time series data, use an expanding windows approach to validate the model and prevent overfitting
Overfitting (a.k.a. data mining) is a concept in data science which occurs when a model is trained exactly to fit the data, making it "rigid" and not perform well on data it has never seen. This defeats the original purpose of building a forecast model.
The common fix to this problem is to separate the data set into two buckets - a training set and a testing set. The former to exclusively train the model, the latter to evaluate forecast accuracy. In common cross-sectional machine learning type models, this is extended with the use of K-fold cross validation , which is essentially the process of slicing your data into K blocks, using K - 1 blocks to train the model, evaluating the model by testing on the remaining block, recording the average forecast error, reshuffling the blocks and repeating the process K times. Once done, calculate the average of the average errors - that's your models' accuracy score.
The problem we have here is that as investors, we are often most interested in time series data where the sequence of observations and time-dependencies matter, requiring a different approach to validating models. Applying K-fold validation is incorrect in the sense that you wouldn't use a model trained on February-to-September 2022 data to predict something that happened earlier in January 2022.
领英推荐
What we want to do is have an expanding window of our training/testing sets. As we go forward in time, we use the cumulative past values to train the model and predict the next observation(s). In each iteration, we expand the training set to include this next observation(s), retrain the model, forecast the next observation(s), calculate the forecast error, and repeat until we exhaust our dataset. Fortunately for us, in Python this is easily implemented using scikit-learn's TimeSeriesSplit and is compatible with GridSearchCV .
By using an expanding window to validate our models, we are mimicking what actually happens in real life, where we only ever have access to past values, know that sequence matters in the data, and need to predict unknown future values. By using expanding windows to validate our models, we are in a better position to choose the model most likely to retain its forecast accuracy when deployed out of sample.
Don't assume more complex models always perform better
Going into this exercise I was excited to explore recently popularized regularization techniques and machine learning algorithms such as Ridge /LASSO /Elastic Net and Random Forest regressions , fully expecting them to far surpass more traditional time series methods (SARIMA & Linear Regression with SARIMA errors ).
Starting from December 2001, with a one month step forward per iteration, I retrain each model for a total of 247 times, each time forecasting the one month ahead US CPI inflation and measuring its accuracy. Using root mean squared error (RMSE ) as the forecast accuracy metric, I rank each model based on their average RMSE across all testing sets.
Here's where the results surprised me. The best performing model turned out to be a Linear Regression (OLS) model augmented with SARIMA errors. Since 1981, the actual values of MoM US CPI growth in absolute terms averaged 30 basis points (bps, i.e., 0.30%), ranging from -180 to +140 bps. The Linear Regression model had the lowest average forecast error of 27.1 bps. Regularization models (Elastic Net, Ridge, LASSO) worked less well, with average errors ranging from 28.7 - 29.6 bps. The Random Forest regressor - a very popular machine learning algorithm - produced a higher forecast error of 30.3 bps, representing only a marginal improvement from a simple AR(1) model.
Full disclosure; I'm not fully certain why these regularization/machine learning techniques didn't perform better, but here are my running hypotheses:
My takeaway is this - when developing forecast models, don't ignore the simpler and more traditional ones. For cross sectional data, always start with a simple Linear Regression. For time series data, start with AR(1) and build up to SARIMA/ Linear Regression with SARIMA errors if needed. Worse case scenario, you have spent a little more time but now have a reasonable baseline with which to judge how well your complex machine learning algorithm performs. Best case scenario however, you end up with a simpler model that outperforms the rest.
This is important because simpler models offer much better interpretability. It's easier to explain the findings from your Linear Regression to your CIO or your clients, pointing out time-dependencies in AR/MA terms, explaining seasonality effects using trend decomposition, and explaining why some predictor variables work and why others don't by discussing the coefficient estimates. That's a whole lot easier than explaining bagging and random feature selection in your Random Forest model.
Speaking of interpretability, so what are the variables that help forecast US CPI inflation?
Now, let's try to draw some conclusions from our Linear Regression with SARIMA errors on what variables help forecast US CPI inflation. Note in the regression output the variables have been differenced to achieve stationarity .
Factors that predict higher CPI inflation
Based on the model, higher EU PPI inflation (D_L_EUPPI) and higher US industrial production index growth (L_PPI) are correlated with lower US CPI inflation. The interpretation here is less clear, so I refrain from drawing any conclusions. Nonetheless, the coefficient estimates are statistically very significant with a p-value ≈?0.00. You can think of the p-value as the false-positive rate, and so I include these variables in the model for its information content in forecasting.
In conclusion
As investors, we may at times need to rely on economic forecasts to guide our asset allocation decisions. Time series data can be difficult to work with, which makes building an accurate and reliable forecast model more challenging than when working with cross-sectional data. To maximize the odds of building a model that retains its forecast performance when you use it in production, it's important to (1) adjust for any lags in data releases to eliminate look ahead bias, (2) use an expanding windows approach to model validation to accurately measure model performance and reduce overfitting, and (3) start with simpler models which can sometimes offer both stronger forecast accuracy (depending on the problem and the data) and better interpretability to communicate easily with stakeholders.
Disclaimers: The opinions expressed in this article are for general informational purposes only and are not intended to provide specific advice or recommendations for any individual or on any specific security or investment product. It is only intended to provide education about the financial industry. The views reflected in the commentary are subject to change at any times without notice, and are of the author's own and not of the organization of the author's affiliation.
Principal at Force Over Mass
2 年Another thoughtful piece that I enjoyed reading. Thanks for sharing!
Carlsquare | Cambridge University | Investment Banking
2 年Great work Scott! Does the average forecast error of the linear regression model get better or worse over time? And I would imagine the that the changes in the CPI basket also have an impact on the predictive power of the models?