3 Things Investors Need to Know About Building Forecast Models

3 Things Investors Need to Know About Building Forecast Models

Introduction

Investors make asset allocation decisions based on a constantly evolving outlook of the future. For many large institutions, this process often starts with formulating projections of key macroeconomic indicators, such as GDP growth, unemployment, and inflation for the market of interest which, in turn, drive expectations for asset class returns.?To the extent that an investor believes their own forecasts can consistently beat the market consensus (big if, fair enough), it makes sense to invest resources into building better forecast models.

Since accelerating inflation has been a dominant investment theme this year, I thought it would be interesting to build a model to forecast the month-on-month (MoM) US CPI inflation using only high level macroeconomic and financial indicators. Using 40 years of monthly data spanning January 1981 - July 2022, I compare the forecast accuracy of traditional econometric models (AR, SARIMA & Linear Regression with SARIMA Errors) against more recently popularized regularization & machine learning models (Ridge, LASSO, Elastic Net & Random Forests).?

For anyone interested in getting started with time series forecasting with a focus on investment related applications, I use this article to document three reflections I've come to appreciate through working on this inflation model. TLDR; when building forecast models, I learned that it's important to:

  • Remove look ahead bias - with time series data, make proper adjustments to the dataset to reflect any lags in data releases. Otherwise, you are wrongly letting your model "peep" into the future, which overstates the accuracy of your model.
  • Minimize overfitting (correctly) - it's important to separate your dataset into training/testing sets when building/evaluating your model to minimize overfitting. However, standard cross validation methods common in many machine learning type problems don't work well on time series data due to the importance of sequencing and time-dependencies. An expanding window approach to validation addresses this.
  • Don't overlook simpler models - depending on the problem and the data available, simpler models can sometimes outperform complex models while offering better interpretability, which can be important for communicating with your stakeholders.

TLDR2; If you're in a hurry - using the change in MoM Growth in US CPI as the target variable, the best performing model was a Linear Regression with SARIMA errors (2, 0, 1)(0, 0, 1). Here's a graph visualizing the out of sample forecast performance of the model using an expanding window approach from 2001-2022. The dark blue line represents the predictions, and the dotted light blue line represents the actual observed values.

No alt text provided for this image

Some implementation details

The target variable is the MoM % Change in US CPI gathered from the U.S. Bureau of Labor . I build 3 "traditional" econometric models - AR(1), SARIMA, and Linear Regression with SARIMA errors - as well as 4 regularization/machine learning models - Ridge, LASSO, Elastic Net and Random Forests regressions.

We start with a full list of 15 exogeneous feature variables summarized in Table 1. For Linear Regression, I manually select variables that are statistically significant at the 10% level. For all regularization/machine learning models, I use all 15 variables. All results shown are post hyperparameter tuning.

No alt text provided for this image

Make adjustments to your dataset to account for lags in data releases to avoid look ahead bias

Look ahead bias is the error when a model is trained on data when in fact, the data would not have been available at that point in time. To crystalize this, let's use our inflation forecast set up to illustrate the problem.

Suppose I wanted to predict the US Consumer Price Index (CPI) MoM inflation using the US Producer Price Index (PPI) MoM inflation because I believe there is a strong correlation between the two. When I extract the data from an external source (in this case Bloomberg), I receive a nice table with the CPI and PPI readings by the observation month. The row for June 2022 will contain both the CPI and PPI readings for the month of June. Easy right?

Not quite so. Digging deeper, we find that the June CPI figure is released with a 10 day lag on 10th July, while the June PPI figure is released with a longer 14 day lag on 14th July. Practically speaking, the latest day you'd need to send the forecast to your CIO or your clients is 9th July, so you don't actually have access to the June PPI figure to forecast June's CPI. Therefore, the correct approach would be to train a model that would explicitly use the May PPI figure to predict the June CPI figure i.e., with a one month lag.

In our case, CPI inflation is more correlated with same month PPI inflation (0.7) compared to the one month lagged PPI inflation (0.5). In modelling terms, if we fail to adjust for the release dates, we overstate the testing accuracy of the model because we wrongly allowed the model to "peep" into the future. The implication is two fold; First, our model doesn't work as well when we deploy it into production because it's been trained on the wrong data and second, come 9th July we don't actually have the data we need to generate the forecast anyway.

Thus, the takeaway here is that we need to inspect the data release dates and manually make adjustments to account for any lags before building the model. While tedious, this is a critical fix we need to apply to achieve a robust model that retains its forecast performance out of sample .

No alt text provided for this image

For time series data, use an expanding windows approach to validate the model and prevent overfitting

Overfitting (a.k.a. data mining) is a concept in data science which occurs when a model is trained exactly to fit the data, making it "rigid" and not perform well on data it has never seen. This defeats the original purpose of building a forecast model.

The common fix to this problem is to separate the data set into two buckets - a training set and a testing set. The former to exclusively train the model, the latter to evaluate forecast accuracy. In common cross-sectional machine learning type models, this is extended with the use of K-fold cross validation , which is essentially the process of slicing your data into K blocks, using K - 1 blocks to train the model, evaluating the model by testing on the remaining block, recording the average forecast error, reshuffling the blocks and repeating the process K times. Once done, calculate the average of the average errors - that's your models' accuracy score.

The problem we have here is that as investors, we are often most interested in time series data where the sequence of observations and time-dependencies matter, requiring a different approach to validating models. Applying K-fold validation is incorrect in the sense that you wouldn't use a model trained on February-to-September 2022 data to predict something that happened earlier in January 2022.

What we want to do is have an expanding window of our training/testing sets. As we go forward in time, we use the cumulative past values to train the model and predict the next observation(s). In each iteration, we expand the training set to include this next observation(s), retrain the model, forecast the next observation(s), calculate the forecast error, and repeat until we exhaust our dataset. Fortunately for us, in Python this is easily implemented using scikit-learn's TimeSeriesSplit and is compatible with GridSearchCV .

No alt text provided for this image

By using an expanding window to validate our models, we are mimicking what actually happens in real life, where we only ever have access to past values, know that sequence matters in the data, and need to predict unknown future values. By using expanding windows to validate our models, we are in a better position to choose the model most likely to retain its forecast accuracy when deployed out of sample.

Don't assume more complex models always perform better

Going into this exercise I was excited to explore recently popularized regularization techniques and machine learning algorithms such as Ridge /LASSO /Elastic Net and Random Forest regressions , fully expecting them to far surpass more traditional time series methods (SARIMA & Linear Regression with SARIMA errors ).

Starting from December 2001, with a one month step forward per iteration, I retrain each model for a total of 247 times, each time forecasting the one month ahead US CPI inflation and measuring its accuracy. Using root mean squared error (RMSE ) as the forecast accuracy metric, I rank each model based on their average RMSE across all testing sets.

Here's where the results surprised me. The best performing model turned out to be a Linear Regression (OLS) model augmented with SARIMA errors. Since 1981, the actual values of MoM US CPI growth in absolute terms averaged 30 basis points (bps, i.e., 0.30%), ranging from -180 to +140 bps. The Linear Regression model had the lowest average forecast error of 27.1 bps. Regularization models (Elastic Net, Ridge, LASSO) worked less well, with average errors ranging from 28.7 - 29.6 bps. The Random Forest regressor - a very popular machine learning algorithm - produced a higher forecast error of 30.3 bps, representing only a marginal improvement from a simple AR(1) model.

No alt text provided for this image

Full disclosure; I'm not fully certain why these regularization/machine learning techniques didn't perform better, but here are my running hypotheses:

  • In Linear Regression with SARIMA errors, and the plain SARIMA model, we are modelling the moving average (MA) terms. MA terms take into account recent prediction errors and 'autocorrects' for it in its next prediction. Effectively, this captures any positive or negative momentum effects in the data and learns from it to improve forecast accuracy.
  • Regularization improves model accuracy by penalizing complexity and learning the most important features in large high dimensional datasets. In our example, the initial list of only 15 exogeneous feature variables was very small to begin with, which made it easy to manually select features that are significant for our Linear Regression model. However, if we started with a list of 10,000 predictors that we wanted to try, perhaps we would have gotten better performance from the LASSO & Elastic Net models because they automatically select features that contribute the most to forecast accuracy.
  • Random Forests regressions seeks to learn complex interactions and non-linearities in the data through building a large ensemble of regression trees. Again, because our initial list of 15 variables is so small, there may not be enough complexity in the data that would allow the Random Forest algorithm to express itself. The other possibility is that linear approximations may already be sufficient to capture the true underlying inflation dynamics (though I am in no position to confirm this).

My takeaway is this - when developing forecast models, don't ignore the simpler and more traditional ones. For cross sectional data, always start with a simple Linear Regression. For time series data, start with AR(1) and build up to SARIMA/ Linear Regression with SARIMA errors if needed. Worse case scenario, you have spent a little more time but now have a reasonable baseline with which to judge how well your complex machine learning algorithm performs. Best case scenario however, you end up with a simpler model that outperforms the rest.

This is important because simpler models offer much better interpretability. It's easier to explain the findings from your Linear Regression to your CIO or your clients, pointing out time-dependencies in AR/MA terms, explaining seasonality effects using trend decomposition, and explaining why some predictor variables work and why others don't by discussing the coefficient estimates. That's a whole lot easier than explaining bagging and random feature selection in your Random Forest model.

Speaking of interpretability, so what are the variables that help forecast US CPI inflation?

Now, let's try to draw some conclusions from our Linear Regression with SARIMA errors on what variables help forecast US CPI inflation. Note in the regression output the variables have been differenced to achieve stationarity .

No alt text provided for this image

Factors that predict higher CPI inflation

  • The University of Michigan monthly survey on twelve month ahead inflation expectations (D_INF_EXP): When survey participants expect higher inflation in the coming year, it tends to correlate with higher CPI inflation in the month of the survey. Why? It's possible that consumers extrapolate recent purchasing experiences into their expectations of future. For instance, if my grocery bill went up significantly this month, I'm more likely to extrapolate this experience into the future and expect higher inflation moving forward when surveyed. For the modeler, this is neat because we can use expectations of the future to forecast near-term inflation.
  • The S&P/Goldman Sachs commodity price index (D_SPGSCI): Energy consumption is weighted ~10% of the US CPI basket (based on 2019-2020 data). Thus, changes in commodity prices, especially crude oil, directly drives near term CPI inflation.
  • The US non-farm payroll additions (NFP): The US non-farm payrolls monthly additions is one of the most important data releases for investors because it is perceived as a leading indicator of economic and labor market conditions in the US. Unsurprisingly, months with stronger job growth tends to correlate with higher CPI inflation.
  • The US Treasury 2Y-10Y yield curve spread (D_US2Y10Y): The slope of the US treasury yield curve is an aggregate reflection of the investment communities' expectation towards near-to-medium term inflation, growth and monetary policy. The most famous of these curves is the 2-year/10-year curve spread , which many consider to be a US recession indicator. Here, we observe that higher spreads (steepening of the curve) tends to correlate with higher near term inflation.

Based on the model, higher EU PPI inflation (D_L_EUPPI) and higher US industrial production index growth (L_PPI) are correlated with lower US CPI inflation. The interpretation here is less clear, so I refrain from drawing any conclusions. Nonetheless, the coefficient estimates are statistically very significant with a p-value ?0.00. You can think of the p-value as the false-positive rate, and so I include these variables in the model for its information content in forecasting.

In conclusion

As investors, we may at times need to rely on economic forecasts to guide our asset allocation decisions. Time series data can be difficult to work with, which makes building an accurate and reliable forecast model more challenging than when working with cross-sectional data. To maximize the odds of building a model that retains its forecast performance when you use it in production, it's important to (1) adjust for any lags in data releases to eliminate look ahead bias, (2) use an expanding windows approach to model validation to accurately measure model performance and reduce overfitting, and (3) start with simpler models which can sometimes offer both stronger forecast accuracy (depending on the problem and the data) and better interpretability to communicate easily with stakeholders.


Disclaimers: The opinions expressed in this article are for general informational purposes only and are not intended to provide specific advice or recommendations for any individual or on any specific security or investment product. It is only intended to provide education about the financial industry. The views reflected in the commentary are subject to change at any times without notice, and are of the author's own and not of the organization of the author's affiliation.

Benjamin Tan

Principal at Force Over Mass

2 年

Another thoughtful piece that I enjoyed reading. Thanks for sharing!

Jonas Bayer

Carlsquare | Cambridge University | Investment Banking

2 年

Great work Scott! Does the average forecast error of the linear regression model get better or worse over time? And I would imagine the that the changes in the CPI basket also have an impact on the predictive power of the models?

要查看或添加评论,请登录

Scott Lee, CFA的更多文章

社区洞察

其他会员也浏览了