登录查看更多内容

Limits of linear models for forecasting

Blaine Bateman, EAF

Chief Data Scientist at EAF LLC

发布日期: 2017年10月20日

Blaine Bateman, President, EAF LLC, October 20, 2017

In this post, I will demonstrate the use of nonlinear models for time series analysis, and contrast to linear models. I will use a (simulated) noisy and nonlinear time series of sales data, use multiple linear regression and a small neural network to fit training data, then predict 90 days forward. I implemented all of this in R, although it could be done in a number of coding environments. (Specifically, I used R 3.4.2 in RStudio 1.1.183 in Windows 10).

It is worth noting that much of what is presented in the literature and trade media regarding neural networks concerns classification problems. Classification means there are a finite number of correct answers given a set of inputs. In image recognition, an application well served by neural networks, classification would include dog/not dog. This is a simplistic example and such methods can predict a very large number of classes, such as reading addresses on mail with machine vision and automatically sorting for delivery. In this post, I am exploring models that produce continuous outputs instead of a finite number of discrete outputs. Neural networks and other methods are very applicable to continuous prediction.

Another point to note is that there are many empirical methods available for time series analysis. For example, ARIMA (auto regressive integrative moving average) and related methods use a combination of time lagged data to predict the future. Often these approaches are used for relatively short-term prediction. In this post, I want to use business knowledge and data in a model to predict future sales. My view is that such models are more likely to behave well over time, and can be adapted for business changes by adding or removing factors deemed newly important or found unimportant.

The linear regression method is available in base R using the lm() function. For the neural network, I used the RPROP Algorithm, published by Martin Riedmiller and Heinrich Braun of the University of Karlsruhe. RPROP is a useful variation of neural network modeling that, in some forms, automatically finds the appropriate learning rate. For details you can read the original paper at https://www.inf.fu-berlin.de/lehre/WS06/Musterererkennung/Paper/rprop.pdf.

For my purposes, I mainly use the rprop+ version of the algorithm, very nicely implemented by Stefan Fritsch & Frauke Guenther with contributors Marc Suling & Sebastian M. Mueller. This is available as a package for R on the CRAN repository here https://CRAN.R-project.org/package=neuralnet. The implementation is available as a library as well as source code. rprop+ appears to be quite resilient in that it easily converges without a lot of hyperparameter tuning. This is important to my point here which is implementation of nonlinear modeling isn’t necessarily more difficult than linear models.

The data are shown here:

Figure 1. Data used in this analysis. Data are synthesized time series data representing sales. Periodic spikes are noted along with long term nonlinear behavior.

The data are a simulated time series of sales data, which has spikes at quarterly and smaller periods, as well as longer term variations. There is about 3 ? years of data at daily granularity, and I want to test the potential to use the first 3 years as training data, then predict another 90 days in the future. The business case is that it is believed there are various factors that influence sales, some internal to our business and some external. We have a set of 8 factors, one of which is past sales, the remaining being market factors (such as GDP, economic activity, etc.) and internal data (such as sales pipeline, sales incentive programs, new product introductions (NPI), etc.). The past sales are used with phasing of one year, at which it is arrived by noting there are annual business cycles. (Note: there are many more rigorous ways to determine phasing; I’ll address that in another post.) These factors are labeled a, c, f, g, h, i, j, and k in what follows. The sales values are labeled as Y. For each model then, the 1210 daily values of the 8 factors are provided, plus the actual sales results, and build a model that fits the historical data as well as possible.

Linear regression

Using the lm() function in R, I fit a linear model to the data. Linear means that each factor is multiplied by a coefficient (determined by the fit process) and these are simply added together to estimate the resulting sales. The equation looks as follows:

Y = C1*a + C2*c + C3*f + C4*g + C5*h + C6*I + C7*j + C8*k + C0

where, as noted, Y is the sales. Note that C0 is a constant value that is also determined by the regression modeling. Once I have such as model, sales are predicted by simply multiplying an instance of the factors by the coefficients and summing to get a prediction for sales. To predict future sales, values of the factors are needed. If the factors are not time lagged from the sales, then, for example, a forecast for GDP or the future NPI plan would be needed. Depending on the specific case, all the factors might be time lagged values and future sales can be predicted from known data. In some cases, a forecast is needed for some of the factors. These details are not important for the evaluation here.

Neural network tests

As a first step, I will use a simple neural network that has 8 input nodes (one for each factor) plus the “bias” node (The bias node is motivated by understanding the behavior of a single unit, also known as a perceptron. Including a bias allows a single perceptron to mimic the entire range of logical operators (like AND, OR, XOR, etc.) and thus is usually included in the network architecture). These 9 nodes are fed into a single hidden layer of 3 nodes, which, along with a bias node, are fed into the output node. The network is shown here:

Figure 2. Simple neural network with one hidden layer comprising three nodes. Values for the eight predictors are presented at the left, the output of those nodes are multiplied by weights (determined by the modeling process) and sent to the hidden layer. The values shown are the weights determined using the training data (see below).

There are (at least) two ways that a neural network representation can model nonlinear behavior. First, every node from a given layer is connected to every node of the next layer. These connections are multiplied by weights before feeding into the next node, where the weights are determined in the modeling process. These cross connections can model interactions between factors that the linear model does not. In addition, a typical neural network node uses a nonlinear function to determine the node output from the node inputs. These functions are often called activation functions, another reference to organic neuron behavior. A common activation function is the sigmoid or logistic function.

Baseline

The two cases were run as indicated, with the results summarized in the following charts. In each case, the model was trained using the training data, excluding the last 90 days to be used as test data. Predictions were then made out to 90 days in the future from the end of the training data.

Figure 3. Training and test data overlaid with model predictions. The linear model exhibits short term features not representative of the data. The nonlinear model appears to perform better.

Figure 4. The same data only over the test/prediction range.

The linear model produces a stair-stepped output. The spikes are exaggerated relative to the original data. The nonlinear model appears to do a better job. On the right, the chart is zoomed in on only the prediction period, and the differences in the model performance are clearer.

Figure 5. Residual errors for the linear model over the training range. The distribution is bimodal indicating some issues fitting the underlying data.

Figure 6. Residual errors for the nonlinear model. The distribution more closely approximates a normal distribution, and is narrower than for the linear model.

The distribution of the residual errors make the model performance easy to see and quantify. These charts represent the behavior of the models in the training data range. The linear model has a bimodal distribution, indicating some issues with the fit. The residual errors from the nonlinear model are closer to a normal distribution, and the range of the errors is nearly 10x smaller than the linear model.

Both models show bias in the test data period, with the linear regression being very strongly biased to low values. This is the result that the linear model could not learn the long-term nonlinear behavior. Because of that, the linear model is biased high at the very early time period and low at the very latest time period. Of note is that both methods capture the large spikes, which are periodic (seasonal/annual) due to business factors.

Nonlinear model architecture

An advantage of the neural network is that more nodes or layers can be added in attempts to achieve better performance, which cannot be done in the linear regression model, unless additional factors are created. To expand on the latter comment, it is possible, if there is justification to do so, to add factors that are nonlinear combinations of the given factors. For example, a new column of data calculated as a*c could be added, or g^2, etc. The challenge to do so is that the resulting model may not extrapolate well, exhibiting a form of so-called over fitting. Over fitting refers to the case where the model complexity is increased arbitrarily to fit the training data but exhibits unwanted behavior and produces inaccurate predictions outside the training range. The common example of overfitting that is provided is to use higher order models, such as quadratic or higher powers, to fit noisy data that isn’t really a higher power function of the predictors.

Returning to the nonlinear neural network, I tested another case using a more complex network having 7, 5, and 3 nodes in 3 hidden layers. This may be overkill, given the good performance of the simple network. Nonetheless, it is interesting to explore the behavior of more complex networks. Note that neural networks for things like autonomous vehicle sensing or search engines have thousands or more nodes, and take a great amount of compute power and time to train. So the examples here are both “simple” in that context. The performance of the 7, 5, 3 network is shown here:

Figure 7. Distribution of residual errors over the training range using a more complex (7, 5, 3 nodes) neural network. The distribution is narrower than with the simple model.

Figure 8. Predictions of more complex neural network over the entire data range.

What is clear is that the residual errors over the training range have narrowed further, and are well centered, meaning there is no bias over the training range. It isn’t very obvious when plotting all the data that there is any improvement, so as before we zoom into the test range:

Figure 9. Performance of the more complex neural network and the same linear model over the test range. The neural network has good short term predictive results but degrades at longer times.

Here we see that the performance isn’t really better, missing the peak and drifting low toward higher dates. However, if we wanted to predict over a shorter time period, this model might be preferred.

Conclusion

I have shown an example where using historical time series data, models can be trained and used to attempt to predict the future. The data show nonlinear trends over time, as well as noise and periodic features. A simple neural network is shown to fit the data better than a multiple linear regression model. Further, the goal is to predict 90 days into the future, and the neural network model does a better job of that as well. Finally, a somewhat more complex network is trained and shows good short-term predictive capability but degrades over the desired prediction period.

In many time series forecast problems, empirical methods such as ARIMA (auto regressive integrative moving average) and similar techniques are used. The motivation for my work is the idea that using business knowledge with internal and external data models can be constructed that reflect the business vs. purely empirical predictions. In my work I have used this method in cases involving over 40 predictors across 100,000+ instances of data.

Grzegorz Sterkowski

Data Analyst / Power BI/ Pharmacokinetics

4 年

Blaine Bateman, EAF LLC is there any code in R available for the followwing research ? Cheers!

Gursel Karacor

Data Science Leader - Machine Learning, Prescriptive Modeling, ANN and Boosted Trees Expert, PhD

4 年

Good work. It would be a good idea to add a traditional model like ARIMA as a third method to compare.

Adarsh Kawadi

Assistant Manager - Data Engineering | KPMG

7 年

Comprehensive explanation. Great work!

Blaine Bateman, EAF

Chief Data Scientist at EAF LLC

7 年

Thank you for your thoughts, Ulderico. My take is that you can have traditional models where parameters are included for specific phenomena, you can have purely empirical approaches like ARIMA, which tend to be used more for short term, and you can have learning models which tend to perform better as you provide more data (assuming the data are relevant). All models vary in short or long term accuracy, and performance can be very case dependent. I am trying to solve customer problems where they want to forecast farther in the future, such as 12 months sales. Multiple linear regression can be misled by periodic data, or if there are nonlinear long-term trends. I have used a method in such cases to find hidden periodic behavior using autocorrelation and Discrete Fourier Transform, then provide the model with a sine and cosine input of the frequencies. (A topic for another article). The regression model can then find coefficients to generate the correct phase and amplitude. Neural networks can "find" relationships in data that we may not have considered or even understand. An area of research for me is how to combine multiple methods and generate robust models.

Ulderico Santarelli

Applied Statistician and Author

7 年

In my opinion, statistical models try to find an inner rule that dictates the time series evolution,. Tehrefore they willl b einaccurate in the short term. Neural Nteworks, on the opposite, are meat to replicate at best observed data, ignoring why they are generated. Interstesting to find a combination of the two taht could combine long term and short term accuracy.

1 次回应

查看更多评论

要查看或添加评论，请登录

Blaine Bateman, EAF的更多文章

Yann LeCun: "Energy-Based Self-Supervised Learning"

2020年5月13日

Yann LeCun: "Energy-Based Self-Supervised Learning"

Yann LeCun gave a talk at the Institute for Pure and Applied Mathematics in December 2019 entitled “Energy-Based…

4 条评论
Regulation Looming for AI Industry

2019年4月15日

Regulation Looming for AI Industry

A recent article in the Insurance Journal asked the question "How Real Are ‘Ethical Artificial Intelligence’ Efforts by…

4 条评论
Entropy

2018年3月29日

Entropy

He huddled, shivering in the blackness. Damn the cold! He tried concentrating to stop the involuntary shaking, but his…
Extending churn analysis to revenue forecasting using R

2018年3月28日

Extending churn analysis to revenue forecasting using R

Note: reposted due to errors on LinkedIn in graphics etc. In this article we will review application of clustering to…

6 条评论
Weighted Linear Regression in R

2018年3月20日

Weighted Linear Regression in R

If you are like me, back in engineering school you learned linear regression as a way to “fit a line to data” and…

5 条评论
Parameter tuning in neural networks for regression

2017年8月5日

Parameter tuning in neural networks for regression

Most of what you see in the news and in other media regarding neural networks is (a) labeled as "Artificial…
Scare Tactics

2016年11月6日

Scare Tactics

It is interesting to consider marketing and sales hype around various topics through the lens of Michael Porter's 5…
On being acquired

2016年4月23日

On being acquired

Although the economy may yet be in a fragile state, there has been a lot of M&A (mergers and acquisitions) this year…
Another layer in the infinity bureaucracy?

2016年3月3日

Another layer in the infinity bureaucracy?

I read this article (https://www.reuters.
Trends in Electronics Distribution Revenues

2016年1月4日

Trends in Electronics Distribution Revenues

Over the years, a number of trade publications have published annual summaries of top electronics distributors. The…

4 条评论

See all articles

Limits of linear models for forecasting

Blaine Bateman, EAF

Chief Data Scientist at EAF LLC

Blaine Bateman, EAF的更多文章

社区洞察

其他会员也浏览了

A Comprehensive Overview of Classification Methods

Decoding the Future: A Deep Dive into Time Series Forecasting and Anomaly Detection

Intuition and Mathematics behind Gradient Descent Algorithm

Introduction to Advanced Traffic Modeling with GPT & CTG++

Evolution of Activation function

N-BEATS: The Unique Interpretable Deep Learning Model for Time Series Forecasting

Noisy by Nature: How AI Learns to Shush the Static

Generative Adversarial Networks (GANs)

Paper Review: Masked Attention is All You Need for Graphs

Activation functions. Sparking Neurons to Life: The Unsung Heroes of AI

Blaine Bateman, EAF的更多文章

Yann LeCun: "Energy-Based Self-Supervised Learning"

Regulation Looming for AI Industry

Entropy

Extending churn analysis to revenue forecasting using R

Weighted Linear Regression in R

Parameter tuning in neural networks for regression

Scare Tactics

On being acquired

Another layer in the infinity bureaucracy?

Trends in Electronics Distribution Revenues

社区洞察

其他会员也浏览了

A Comprehensive Overview of Classification Methods

Decoding the Future: A Deep Dive into Time Series Forecasting and Anomaly Detection

Intuition and Mathematics behind Gradient Descent Algorithm

Introduction to Advanced Traffic Modeling with GPT & CTG++

Evolution of Activation function

N-BEATS: The Unique Interpretable Deep Learning Model for Time Series Forecasting

Noisy by Nature: How AI Learns to Shush the Static

Generative Adversarial Networks (GANs)

Paper Review: Masked Attention is All You Need for Graphs

Activation functions. Sparking Neurons to Life: The Unsung Heroes of AI