Parameter tuning in neural networks for regression
Most of what you see in the news and in other media regarding neural networks is (a) labeled as "Artificial Intelligence", and (b) deals with classification problems. However, neural networks can be applied to time series problems as well, in place of traditional regression methods. The potential advantages of using a neural network over a linear regression method are the fact that neural networks can easily model non-linear behavior. Not that there aren't regression methods for non-linear systems, but the non-linearity arises naturally from the structure and definition of neural networks. The downside is training a neural network on a large time series multi-variate data set is computationally intensive, especially compared to linear regression. The time required to train a neural network is "tunable" as you can set a learning rate, among other parameters, to speed things up. The challenge can be figuring out what values to use for the parameters. My goal here is to show in a very simple case how results can be changed dramatically by small variations in parameters.
There are plenty of accessible libraries to train neural networks in your development environment of choice. In this article, I'll show some results using Weka, an open-source and free machine learning environment developed by the University of Waikato in New Zealand. Weka is a little different than, say, using a library in R, or with Python. Weka offers a console user interface where you can set up a wide range of different algorithms and train them on your data. It is self-contained in that regard, which makes it great for beginners. I think you could teach interested middle school nerds how to use it. For power users and those that want to deploy trained models, Weka offers integration via Java among other approaches. Everything in this article was done running in console mode in Windows 10 on a reasonably powerful laptop.
One challenge in training a neural network is how to know it is "finished". In Weka, you can set the number of epochs (or training time, if you prefer) which directly controls how long the system will try to learn. You can also set two other parameters that have an impact on training time an accuracy--the learning rate, and the momentum. Basically, learning rate is how big the steps are that the algorithm makes on each iteration. Momentum is an enhancement to the standard gradient descent algorithm that brings along some of the last update to the current one (hence, "momentum"). These parameters (and there are many others you can tweak) are inter-related. If you have a finite amount of time, a faster learning rate can get you closer to the optimum in the allotted time. However, if it is too large, you can send the system into an endless back and forth as it overshoots and undershoots infinitely. Momentum could be viewed as a way to get the most out of a learning rate--too small and you gain nothing.
Consider the very simple system shown here:
This system represents a simplified measurement of something vs. time, where there is noise in the system that affects measurements. We happen to have actual (known) values up to about 2 1/2 minutes. Perhaps we are trying to replicate in a laboratory a system studied previously for which expected values are published, which we take as actual. We want to be able to predict the behavior at various times, so we take measurements on our experiment, then try to model them. We chart the data and see it is curved upward in time, but we don't know a-priori the equation shown. Let's model it with regression using a neural network.
We collect data as indicated by the table. We then import this into Weka as a csv file, and select the "MultilayerPerceptron" function to train on our data. We use the default parameters other than defining a 2-hidden layer network with 2 nodes in each layer, and after training on our data, we get a model as follows:
Weka actually just gives you all the values; we've put them into a diagram to clarify what is going on. On the left are our measured values vs. time data; the data for 10 seconds is used as an example. All the other numbers are the parameters the system learned for our data. Each node other than the output node uses a weighted sum of the inputs (shown as the wij values) and the sigmoid function, also known as the logistic function, to calculate the output values (shown as the sij values). The data go through layer 1, those outputs go through layer 2, and those outputs go through layer 3 which is just a single, linear node. The data are scaled before use, so the output is scaled back to our units. Looking at this result, it isn't impressive. We have a really simple system, and we predict -186 versus the actual value of 54. What happened?
On the left is a set of samples taken from the time series, with the actual value and the predicted value charted together. In that view, things look pretty good. But if we zoom in to the values for short times, we see a different picture--there are pretty large errors up to at least 20 seconds. Is the neural network approach incapable of learning this system? No, the problem is our choice of parameters. In particular, the default learning rate is too high, and the training time is too short. The default values are a learning rate of 0.3 and 500 epochs to train. Let's change the learning rate to 0.1, and the momentum to 0.1 while we are at it, and let's give the system 4096 epochs to learn. We get the following:
In this case, for the same input values, we are much closer to the correct value. Looking at all the results shows this model works well over the entire range of time:
So, is the answer just to set a really low learning rate, modest momentum, and throw gobs of compute time at the problem? Unfortunately, not quite. If we take this system, and carefully explore the results across the parameter space of learning rate, momentum, and training time, we get a 3D map of the performance:
Here, the vertical axis is the root-mean-square error, ranging from 0 to 250 in this case. We vary the training time and the rate parameters as shown on the other axes. You can see that for any given time (Epochs) if the learning rate/momentum are too high, the results are degraded. The issue is that too high a rate can miss the optimum. Likewise, results are worse if the training time is shorter. However, we see something a little strange--at low learning rates, there is a region where we don't get the optimum result. Why? We have not provided enough time to converge to a good solution given the small steps we are taking. If we had this chart ahead of time, we could see that our time to get a good solution can be a low as 512 Epochs if the learning rate is just right.
As more and more people gain access to machine learning tools, awareness of limitations and need to test solutions is very important. If you get what seems like a good result, consider testing it by running another few learning sessions with different parameters. If you get very similar results you will have more confidence you are at an optimal (or nearly so) solution.