登录查看更多内容

Parameter tuning in neural networks for regression

Blaine Bateman, EAF

Chief Data Scientist at EAF LLC

发布日期: 2017年8月5日

Most of what you see in the news and in other media regarding neural networks is (a) labeled as "Artificial Intelligence", and (b) deals with classification problems. However, neural networks can be applied to time series problems as well, in place of traditional regression methods. The potential advantages of using a neural network over a linear regression method are the fact that neural networks can easily model non-linear behavior. Not that there aren't regression methods for non-linear systems, but the non-linearity arises naturally from the structure and definition of neural networks. The downside is training a neural network on a large time series multi-variate data set is computationally intensive, especially compared to linear regression. The time required to train a neural network is "tunable" as you can set a learning rate, among other parameters, to speed things up. The challenge can be figuring out what values to use for the parameters. My goal here is to show in a very simple case how results can be changed dramatically by small variations in parameters.

There are plenty of accessible libraries to train neural networks in your development environment of choice. In this article, I'll show some results using Weka, an open-source and free machine learning environment developed by the University of Waikato in New Zealand. Weka is a little different than, say, using a library in R, or with Python. Weka offers a console user interface where you can set up a wide range of different algorithms and train them on your data. It is self-contained in that regard, which makes it great for beginners. I think you could teach interested middle school nerds how to use it. For power users and those that want to deploy trained models, Weka offers integration via Java among other approaches. Everything in this article was done running in console mode in Windows 10 on a reasonably powerful laptop.

One challenge in training a neural network is how to know it is "finished". In Weka, you can set the number of epochs (or training time, if you prefer) which directly controls how long the system will try to learn. You can also set two other parameters that have an impact on training time an accuracy--the learning rate, and the momentum. Basically, learning rate is how big the steps are that the algorithm makes on each iteration. Momentum is an enhancement to the standard gradient descent algorithm that brings along some of the last update to the current one (hence, "momentum"). These parameters (and there are many others you can tweak) are inter-related. If you have a finite amount of time, a faster learning rate can get you closer to the optimum in the allotted time. However, if it is too large, you can send the system into an endless back and forth as it overshoots and undershoots infinitely. Momentum could be viewed as a way to get the most out of a learning rate--too small and you gain nothing.

Consider the very simple system shown here:

This system represents a simplified measurement of something vs. time, where there is noise in the system that affects measurements. We happen to have actual (known) values up to about 2 1/2 minutes. Perhaps we are trying to replicate in a laboratory a system studied previously for which expected values are published, which we take as actual. We want to be able to predict the behavior at various times, so we take measurements on our experiment, then try to model them. We chart the data and see it is curved upward in time, but we don't know a-priori the equation shown. Let's model it with regression using a neural network.

We collect data as indicated by the table. We then import this into Weka as a csv file, and select the "MultilayerPerceptron" function to train on our data. We use the default parameters other than defining a 2-hidden layer network with 2 nodes in each layer, and after training on our data, we get a model as follows:

Weka actually just gives you all the values; we've put them into a diagram to clarify what is going on. On the left are our measured values vs. time data; the data for 10 seconds is used as an example. All the other numbers are the parameters the system learned for our data. Each node other than the output node uses a weighted sum of the inputs (shown as the wij values) and the sigmoid function, also known as the logistic function, to calculate the output values (shown as the sij values). The data go through layer 1, those outputs go through layer 2, and those outputs go through layer 3 which is just a single, linear node. The data are scaled before use, so the output is scaled back to our units. Looking at this result, it isn't impressive. We have a really simple system, and we predict -186 versus the actual value of 54. What happened?

On the left is a set of samples taken from the time series, with the actual value and the predicted value charted together. In that view, things look pretty good. But if we zoom in to the values for short times, we see a different picture--there are pretty large errors up to at least 20 seconds. Is the neural network approach incapable of learning this system? No, the problem is our choice of parameters. In particular, the default learning rate is too high, and the training time is too short. The default values are a learning rate of 0.3 and 500 epochs to train. Let's change the learning rate to 0.1, and the momentum to 0.1 while we are at it, and let's give the system 4096 epochs to learn. We get the following:

In this case, for the same input values, we are much closer to the correct value. Looking at all the results shows this model works well over the entire range of time:

So, is the answer just to set a really low learning rate, modest momentum, and throw gobs of compute time at the problem? Unfortunately, not quite. If we take this system, and carefully explore the results across the parameter space of learning rate, momentum, and training time, we get a 3D map of the performance:

Here, the vertical axis is the root-mean-square error, ranging from 0 to 250 in this case. We vary the training time and the rate parameters as shown on the other axes. You can see that for any given time (Epochs) if the learning rate/momentum are too high, the results are degraded. The issue is that too high a rate can miss the optimum. Likewise, results are worse if the training time is shorter. However, we see something a little strange--at low learning rates, there is a region where we don't get the optimum result. Why? We have not provided enough time to converge to a good solution given the small steps we are taking. If we had this chart ahead of time, we could see that our time to get a good solution can be a low as 512 Epochs if the learning rate is just right.

As more and more people gain access to machine learning tools, awareness of limitations and need to test solutions is very important. If you get what seems like a good result, consider testing it by running another few learning sessions with different parameters. If you get very similar results you will have more confidence you are at an optimal (or nearly so) solution.

要查看或添加评论，请登录

Blaine Bateman, EAF的更多文章

Yann LeCun: "Energy-Based Self-Supervised Learning"

2020年5月13日

Yann LeCun: "Energy-Based Self-Supervised Learning"

Yann LeCun gave a talk at the Institute for Pure and Applied Mathematics in December 2019 entitled “Energy-Based…

4 条评论
Regulation Looming for AI Industry

2019年4月15日

Regulation Looming for AI Industry

A recent article in the Insurance Journal asked the question "How Real Are ‘Ethical Artificial Intelligence’ Efforts by…

4 条评论
Entropy

2018年3月29日

Entropy

He huddled, shivering in the blackness. Damn the cold! He tried concentrating to stop the involuntary shaking, but his…
Extending churn analysis to revenue forecasting using R

2018年3月28日

Extending churn analysis to revenue forecasting using R

Note: reposted due to errors on LinkedIn in graphics etc. In this article we will review application of clustering to…

6 条评论
Weighted Linear Regression in R

2018年3月20日

Weighted Linear Regression in R

If you are like me, back in engineering school you learned linear regression as a way to “fit a line to data” and…

5 条评论
Limits of linear models for forecasting

2017年10月20日

Limits of linear models for forecasting

Blaine Bateman, President, EAF LLC, October 20, 2017 In this post, I will demonstrate the use of nonlinear models for…

8 条评论
Scare Tactics

2016年11月6日

Scare Tactics

It is interesting to consider marketing and sales hype around various topics through the lens of Michael Porter's 5…
On being acquired

2016年4月23日

On being acquired

Although the economy may yet be in a fragile state, there has been a lot of M&A (mergers and acquisitions) this year…
Another layer in the infinity bureaucracy?

2016年3月3日

Another layer in the infinity bureaucracy?

I read this article (https://www.reuters.
Trends in Electronics Distribution Revenues

2016年1月4日

Trends in Electronics Distribution Revenues

Over the years, a number of trade publications have published annual summaries of top electronics distributors. The…

4 条评论

See all articles

Parameter tuning in neural networks for regression

Blaine Bateman, EAF

Chief Data Scientist at EAF LLC

Blaine Bateman, EAF的更多文章

社区洞察

其他会员也浏览了

Grokking: A Deep Dive into Delayed Generalization in Neural Networks

Understanding Backpropagation: A Deep Dive into Neural Networks

Demystifying Neural Networks: A Beginner's Guide (Part 5) - The Brains Behind the Training

Demystifying Neural Networks (Part 6): A Look Back and a Look Ahead

IBM’s Quest to Solve the Continual Learning Problem and Build Neural Networks Without Amnesia

Neural networks are not the path to building self-aware Artificial Intelligence. It's time we understand the role of emotions and values....

Artificial Neural Network

Capsule Neural Networks (CapsNets)

Neural Network Algorithms – Learn How To Train ANN

How to Improve the Performance of a Neural Network: A Beginner’s Guide

Blaine Bateman, EAF的更多文章

Yann LeCun: "Energy-Based Self-Supervised Learning"

Regulation Looming for AI Industry

Entropy

Extending churn analysis to revenue forecasting using R

Weighted Linear Regression in R

Limits of linear models for forecasting

Scare Tactics

On being acquired

Another layer in the infinity bureaucracy?

Trends in Electronics Distribution Revenues

社区洞察

其他会员也浏览了

Grokking: A Deep Dive into Delayed Generalization in Neural Networks

Understanding Backpropagation: A Deep Dive into Neural Networks

Demystifying Neural Networks: A Beginner's Guide (Part 5) - The Brains Behind the Training

Demystifying Neural Networks (Part 6): A Look Back and a Look Ahead

IBM’s Quest to Solve the Continual Learning Problem and Build Neural Networks Without Amnesia

Neural networks are not the path to building self-aware Artificial Intelligence. It's time we understand the role of emotions and values....

Artificial Neural Network

Capsule Neural Networks (CapsNets)

Neural Network Algorithms – Learn How To Train ANN

How to Improve the Performance of a Neural Network: A Beginner’s Guide