The Emergence of Machine Learning in Forecasting– a Field Where Statistical Models Dominate
Alkiviadis Vazacopoulos
Educator & World Leading Expert in Combining AI & Prescriptive & Predictive Application Development and Deployment
Christa Ruiz and Alkis Vazacopoulos, Stevens Institute of Technology.
Artificial Intelligence (AI) has grown to become especially prominent throughout the last decade and thus has found applications in the field of forecasting. A substantial amount of research has been conducted in the class of AI regarding Machine Learning methods (ML) and Neural Networks (NNs), and how they can be utilized in forecasting to improve time-series predictions. However, statistical models are already known to be reliable and accurate at forecasting.
ML and statistical methods share the same objective, as they both aim to improve forecasting accuracy through the minimization of a loss function, which is typically the sum of squared errors. ML methods are computationally more demanding than statistical ones, as they require much more dependence on computer science in implementation.
At the time being, studies performed on ML methods have been characterized by three major limitations. First, their conclusions are based on a few, or even a single time series, which questions how statistically significant their results can be. Second, the methods are evaluated for short-term forecasting horizons, and therefore are excluding the evaluation of medium and long-term ones. Finally, benchmarks are not used to compare the accuracy of ML methods versus alternatives.
The research paper [1] studied the performance of ten ML models and methods regarding their accuracy. This was done through the use of a subset of 1045 monthly series from the 3003 of the M3 Competition. The forecasting model was developed using the first n ? 18 observations, where n was the length of the series. Then, 18 forecasts were produced, and their accuracy was evaluated in comparison to the actual values not used to develop the forecasting model. Information on each ML method is as follows:
1. Multi-Layer Perceptron (MLP). MLP is arguably the most popular of the ML methods. The first step in MLP is that a single hidden layer NN is constructed. Then, the best number of input nodes N = [1, 2, . . ., 5] is defined by using a 10-fold validation process, with the inputs being observations Yt?5, Yt?4, Yt?3, Yt?2, and Yt?1 for predicting the time series at point t, and doing so for all the n ? 18 data. Third, the number of the hidden nodes is set to 2N + 1, aimed at decreasing the computational time needed for constructing the NN model (the number of the hidden layers used is typical of secondary importance). The Scaled Conjugate Gradient method is then used instead of Standard Backpropagation for estimating the optimal weights. The method is an alternative to the Levenberg-Marquardt algorithm, as it is considered to be more appropriate for weight optimization. The learning rate is selected between 0.1 and 1, using random initial weights for starting the training process with a maximum of 500 iterations. To maximize the flexibility of the method, a linear function is used for the output nodes. This is crucial since, if a logistic output activation function is used for optimizing trended time series, it is bounded. Due to the nonlinear activation functions, the data is scaled between 0 to 1 to avoid computational problems, meet algorithm requirements and facilitate faster network learning. Once the predictions are made, forecasts are then rescaled back to the original scale.
2. Bayesian Neural Network (BNN). The BNN is similar to the MLP method but optimizes the network parameters according to the Bayesian concept, meaning that the weights are estimated assuming some a priori distributions of errors. The Nguyen and Widrow algorithm is used to assign initial weights and the Gauss-Newton algorithm to perform the optimization. Similar to the MLP method, the best number of input nodes N = [1, 2, . . ., 5] is defined using a 10-fold validation process and the number of the hidden nodes is set to 2N + 1. A total number of 500 iterations are considered and the data are linearly scaled.
3. Radial Basis Functions (RBF). RBF is a feed-forward network with one hidden layer and is similar to the MLP method. This method is more interpretable and faster to compute since it performs a linear combination of n basis functions that are radially symmetric around a center, which means the information is represented locally in the network. Again, the best number of input nodes N = [1, 2, . . ., 5] is defined using a 10-fold validation process and the number of the hidden nodes is automatically set to 2N + 1. A total number of 500 iterations are considered and the data are linearly scaled. The output activation function is the linear one.
4. Generalized Regression Neural Networks (GRNN). Also called the Nadaraya-Watson estimator or the kernel regression estimator, the GRNN method contrasts the previous methods as it is nonparametric, and the predictions are found by averaging the target outputs of the training data points according to their distance from the observation provided each time. The sigma parameter, which determines the smoothness of fit, is selected together with the number of inputs N using the 10-fold validation process. The inputs, linearly scaled, vary from 1 to 5 and the sigma from 0.05 to 1, with a step of 0.05.
5. K-Nearest Neighbor regression (KNN). KNN is a nonparametric regression method basing its forecasts on the Euclidean distance between the points used for training and testing the method. Thus, given the N inputs, the method picks the closest K training data points and sets the prediction as the average of the target output values for these points. The K parameter, which determines the smoothness of fit, is once again optimized together with the number of inputs using the 10-fold validation process. The inputs, which are linearly scaled, may vary from 1 to 5 and the K from 2 to 10.
6. CART regression trees (CART). CART is a regression method based on tree-like recursive partitioning of the input space. The space specified by the training sample is divided into regions, called the terminal leaves. Then, a sequence of tests is introduced and applied to decision nodes in order to define in which leave node an object should be classified based on the input provided. The tests are applied serially from the root node to the leaves, until a final decision is made. Like the previous approaches, the total number of input nodes N = [1, 2, . . ., 5] is defined using a 10-fold validation process and are then linearly scaled.
7. Support Vector Regression (SVR). SVR is the regression process that tries to identify the hyperplane that maximizes the margin between two classes and minimizes the total error under tolerance. Forecasts were produced using an ??-regression SVM which maximizes the borders of the margin under suitable conditions to avoid outlier inclusion, allowing the SVM to decide the number of the support vectors needed. The kernel used in training and predicting is the radial basic one, mainly due to its good general performance and the few parameters it requires. ?? is set equal to the noise level of the training sample, while the cost of constraints violation C is fixed to the maximum of the target output values, which is 1. Then, the γ parameter is optimized together with the total number of inputs N set for the method, using a 10-fold validation process. The inputs are linearly scaled as in the previous methods described.
8. Gaussian Processes (GP). With GP, every target variable can be associated with one or more normally distributed random variables which form a multivariate normal distribution, emerging by combining the individual distributions of the independent ones. Thus, Gaussian processes can serve as a nonparametric regression method that assumes an a priori distribution for the input variables provided during training and then combines them appropriately using a measure of similarity between points (the kernel function) to predict the future value of the variable of interest. The input variables are the past observations of the time series, linearly scaled, while their total number N = [1, 2,. . ., 5] is defined using a 10-fold validation process. The kernel function used is the radial basis one, while the initial noise variance and the tolerance of termination were set to 0.001 given that, it would be computationally prohibitive to use a three-dimensional 10-fold validation approach.
9. Recurrent Neural Network (RNN). Simple RNN, also known as the Elman network, has a similar structure to the MLP but contains feedback connections in order to take into account previous states along with the current input before producing the final output(s). This is done by saving a copy of the previous values of the layer containing the recurrent nodes and using them as an additional input for the next step. For this study, the model used to implement the RNN is the sequential one. It is composed of two layers, a hidden one containing recurrent nodes and an output one containing one or more linear nodes. Due to high computational requirements, k-fold validation was not used for choosing the optimal network architecture per series but rather three input nodes and six recurrent units, forming the hidden layer, for all the time series of the dataset. The selection was made based on the results of a random sample of series for which this parameterization displayed the best performance. Regarding the rest of the hyper-parameters, a number of 500 epochs was chosen and the learning ratio was set to 0.001, with the linear activation function being used in all nodes.
10. Long Short Term Memory neural network (LSTM). The LSTM network is similar to RNN and was proposed to avoid the long-term dependency problem present in the case of the latter. The advantage LSTM units have over regular RNN units is their ability to keep information over longer periods of time due to their complex architecture which consists of several gates with the power to remove or add information to the unit’s state. Similar to RNN, the model used to implement the LSTM network is the sequential one consisting of a hidden and an output layer. Similarly, due to high computational time, the architecture of the model consists of three input nodes, six LSTM units forming the hidden layer, and a single linear node in the output layer. The linear activation function is used before the output of all units and the hard sigmoid one for the recurrent step. Regarding the rest of the hyper-parameters, the rmsprop optimizer was used, a number of 500 epochs was chosen and the learning ratio was set to 0.001.
The two accuracy measures used in the evaluations were the symmetric Mean Absolute Percentage Error (sMAPE) and the Mean Absolute Scaled Error (MASE).
?To increase the speed of computations, it was determined what the best preprocessing alternative for improving the post-sample one-step-ahead forecasting performance of the MLP method would be, and then that was applied to the other ML models.
The table 4 [1] demonstrates that the best combination according to sMAPE is Box-Cox transformation and deseasonalization, while the best one according to MASE is Box-Cox transformation, deseasonalization, and detrending.
Then, the researchers [1] evaluated the forecasting performance of ML methods compared to statistical ones.
As the tables 8,9 [1] demonstrate, statistical models generally outperform ML methods across all of the forecasting horizons. Theta, Comb, and ARIMA prove to be the most dominant methods among the competitors according to both MASE and sMAPE. It is also important to note that Computational Complexity (CC) is not necessarily significantly higher for the best-performing methods. More complex ML methods such as the Direct and Multi MLP display less accurate results, demonstrating how complexity does not correlate to a more accurate model, and generally proves that ML methods fail to learn how to best predict forecasting horizons.
The goal of ML models is to learn by solving an optimization problem in order to choose a set of parameters that minimize an error function. However, the same optimization is already being done in ARIMA models. Therefore, there is no justification as to why MLP, one of the best ML models, has a 1.24% higher sMAPE over ARIMA. Respectively, it should be expected that the more advanced types of NNs, RNN, and LSTM, would be far more accurate than the ARIMA and the rest of the statistical methods utilized. Essentially, if there was any form of learning, the accuracy of ML methods should have performed better than ARIMA and greatly exceeded the model of Naive 2.
A more serious question ML methods face is if and how they can be made to learn about the unknown future rather as opposed to how well a model fits past data. For this to be possible, ML methods would need to have access to information about the future and their objective must be to minimize future errors rather than those of fitting a model to available data. However ML is not yet this advanced, and until a time when it is, the researchers suggest that the data is deseasonalized before an ML model is utilized since research has shown little to no differences between the post-sample accuracy of models applied to original and seasonally adjusted data.
In summary, the results of the study [1] prove that ML methods must improve to become more accurate, require less computer time, and overall be understood more. Clearly, traditional statistical methods are more accurate than ML ones and the reasons why are still yet to be known for sure. In the comparisons of the statistical and ML methods reported directly in this paper, it is important to note that the results could be related to the specific data set that they utilized. Either way, the results demonstrate the need for objective and unbiased ways to test the performance of forecasting methods that can be achieved through sizable competitions that would allow for significant comparisons and conclusions.
References
1.????Makridakis S, Spiliotis E, Assimakopoulos V (2018) Statistical and Machine Learning forecasting methods: Concerns and ways forward. PLoS ONE 13(3): e0194889. https://doi.org/ 10.1371/journal.pone.0194889