Parameter tuning in neural networks for regression
Source: EAF analysis

Parameter tuning in neural networks for regression

Most of what you see in the news and in other media regarding neural networks is (a) labeled as "Artificial Intelligence", and (b) deals with classification problems. However, neural networks can be applied to time series problems as well, in place of traditional regression methods. The potential advantages of using a neural network over a linear regression method are the fact that neural networks can easily model non-linear behavior. Not that there aren't regression methods for non-linear systems, but the non-linearity arises naturally from the structure and definition of neural networks. The downside is training a neural network on a large time series multi-variate data set is computationally intensive, especially compared to linear regression. The time required to train a neural network is "tunable" as you can set a learning rate, among other parameters, to speed things up. The challenge can be figuring out what values to use for the parameters. My goal here is to show in a very simple case how results can be changed dramatically by small variations in parameters.

There are plenty of accessible libraries to train neural networks in your development environment of choice. In this article, I'll show some results using Weka, an open-source and free machine learning environment developed by the University of Waikato in New Zealand. Weka is a little different than, say, using a library in R, or with Python. Weka offers a console user interface where you can set up a wide range of different algorithms and train them on your data. It is self-contained in that regard, which makes it great for beginners. I think you could teach interested middle school nerds how to use it. For power users and those that want to deploy trained models, Weka offers integration via Java among other approaches. Everything in this article was done running in console mode in Windows 10 on a reasonably powerful laptop.

One challenge in training a neural network is how to know it is "finished". In Weka, you can set the number of epochs (or training time, if you prefer) which directly controls how long the system will try to learn. You can also set two other parameters that have an impact on training time an accuracy--the learning rate, and the momentum. Basically, learning rate is how big the steps are that the algorithm makes on each iteration. Momentum is an enhancement to the standard gradient descent algorithm that brings along some of the last update to the current one (hence, "momentum"). These parameters (and there are many others you can tweak) are inter-related. If you have a finite amount of time, a faster learning rate can get you closer to the optimum in the allotted time. However, if it is too large, you can send the system into an endless back and forth as it overshoots and undershoots infinitely. Momentum could be viewed as a way to get the most out of a learning rate--too small and you gain nothing.

Consider the very simple system shown here:

This system represents a simplified measurement of something vs. time, where there is noise in the system that affects measurements. We happen to have actual (known) values up to about 2 1/2 minutes. Perhaps we are trying to replicate in a laboratory a system studied previously for which expected values are published, which we take as actual. We want to be able to predict the behavior at various times, so we take measurements on our experiment, then try to model them. We chart the data and see it is curved upward in time, but we don't know a-priori the equation shown. Let's model it with regression using a neural network.

We collect data as indicated by the table. We then import this into Weka as a csv file, and select the "MultilayerPerceptron" function to train on our data. We use the default parameters other than defining a 2-hidden layer network with 2 nodes in each layer, and after training on our data, we get a model as follows:

Weka actually just gives you all the values; we've put them into a diagram to clarify what is going on. On the left are our measured values vs. time data; the data for 10 seconds is used as an example. All the other numbers are the parameters the system learned for our data. Each node other than the output node uses a weighted sum of the inputs (shown as the wij values) and the sigmoid function, also known as the logistic function, to calculate the output values (shown as the sij values). The data go through layer 1, those outputs go through layer 2, and those outputs go through layer 3 which is just a single, linear node. The data are scaled before use, so the output is scaled back to our units. Looking at this result, it isn't impressive. We have a really simple system, and we predict -186 versus the actual value of 54. What happened?

On the left is a set of samples taken from the time series, with the actual value and the predicted value charted together. In that view, things look pretty good. But if we zoom in to the values for short times, we see a different picture--there are pretty large errors up to at least 20 seconds. Is the neural network approach incapable of learning this system? No, the problem is our choice of parameters. In particular, the default learning rate is too high, and the training time is too short. The default values are a learning rate of 0.3 and 500 epochs to train. Let's change the learning rate to 0.1, and the momentum to 0.1 while we are at it, and let's give the system 4096 epochs to learn. We get the following:

In this case, for the same input values, we are much closer to the correct value. Looking at all the results shows this model works well over the entire range of time:

So, is the answer just to set a really low learning rate, modest momentum, and throw gobs of compute time at the problem? Unfortunately, not quite. If we take this system, and carefully explore the results across the parameter space of learning rate, momentum, and training time, we get a 3D map of the performance:

Here, the vertical axis is the root-mean-square error, ranging from 0 to 250 in this case. We vary the training time and the rate parameters as shown on the other axes. You can see that for any given time (Epochs) if the learning rate/momentum are too high, the results are degraded. The issue is that too high a rate can miss the optimum. Likewise, results are worse if the training time is shorter. However, we see something a little strange--at low learning rates, there is a region where we don't get the optimum result. Why? We have not provided enough time to converge to a good solution given the small steps we are taking. If we had this chart ahead of time, we could see that our time to get a good solution can be a low as 512 Epochs if the learning rate is just right.

As more and more people gain access to machine learning tools, awareness of limitations and need to test solutions is very important. If you get what seems like a good result, consider testing it by running another few learning sessions with different parameters. If you get very similar results you will have more confidence you are at an optimal (or nearly so) solution.

要查看或添加评论,请登录

Blaine Bateman, EAF的更多文章

  • Yann LeCun: "Energy-Based Self-Supervised Learning"

    Yann LeCun: "Energy-Based Self-Supervised Learning"

    Yann LeCun gave a talk at the Institute for Pure and Applied Mathematics in December 2019 entitled “Energy-Based…

    4 条评论
  • Regulation Looming for AI Industry

    Regulation Looming for AI Industry

    A recent article in the Insurance Journal asked the question "How Real Are ‘Ethical Artificial Intelligence’ Efforts by…

    4 条评论
  • Entropy

    Entropy

    He huddled, shivering in the blackness. Damn the cold! He tried concentrating to stop the involuntary shaking, but his…

  • Extending churn analysis to revenue forecasting using R

    Extending churn analysis to revenue forecasting using R

    Note: reposted due to errors on LinkedIn in graphics etc. In this article we will review application of clustering to…

    6 条评论
  • Weighted Linear Regression in R

    Weighted Linear Regression in R

    If you are like me, back in engineering school you learned linear regression as a way to “fit a line to data” and…

    5 条评论
  • Limits of linear models for forecasting

    Limits of linear models for forecasting

    Blaine Bateman, President, EAF LLC, October 20, 2017 In this post, I will demonstrate the use of nonlinear models for…

    8 条评论
  • Scare Tactics

    Scare Tactics

    It is interesting to consider marketing and sales hype around various topics through the lens of Michael Porter's 5…

  • On being acquired

    On being acquired

    Although the economy may yet be in a fragile state, there has been a lot of M&A (mergers and acquisitions) this year…

  • Another layer in the infinity bureaucracy?

    Another layer in the infinity bureaucracy?

    I read this article (https://www.reuters.

  • Trends in Electronics Distribution Revenues

    Trends in Electronics Distribution Revenues

    Over the years, a number of trade publications have published annual summaries of top electronics distributors. The…

    4 条评论

社区洞察

其他会员也浏览了