Optimization operations in supervised learning and hyperparameter choices

Juan David Tuta Botero

Data Science | Machine Learning | Artificial Intelligence

发布日期: 2022年1月17日

In the current world, it’s hard to imagine one industry or company that is not interested to implement machine learning and artificial intelligence in their process. But at the first glance to have proper models and correctly train them you may need really powerful machines, that’s was the truth at the beginning of the past decade, but the current developments in optimization techniques in the learning rates applied in neural networks had become a reality where anyone with the proper knowledge of them can be competitive

Feature scaling

Feature scaling?is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the?data processing?step.

Since the range of values of raw data varies widely, in some machine learning?algorithms, objective functions will not work properly without normalization. For example, many?classifiers?calculate the distance between two points by the?euclidian distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

No hay texto alternativo para esta imagen

Batch normalization

Batch normalization?(also known as?a batch norm) is a method used to make?artificial neural networks?faster and more stable through normalization of the layers' inputs by re-centering and re-scaling.

While the effect of batch normalization is evident, the reasons behind its effectiveness remain under discussion. It was believed that it can mitigate the problem of?internal covariate shift, where parameter initialization and changes in the distribution of the inputs of each layer affect the learning rate of the network.?Recently, some scholars have argued that batch normalization does not reduce internal covariate shift, but rather smooths the?objective function, which in turn improves the performance.

Mini-batch gradient descent

The normal gradient descent in simple words is a vector generated by the neural network that allows us to interact between the parameters and a cost function to find the minimum error in the system we are working with. But what happens when the training set gets too big, well if we analyzed traditionally for each training element we will do a correcting by the gradient descent but it will take too much time and required processor power so instead we will separate them in batches to reduce the number of iteration to complete the task. As an example, We have a 5' 000 000 data set to test and we are going to create batches of 1000 to optimize the procedure of training.

Doing this instead of passing through a 5' 000 000 times each epoch we are just going to do it five thousand times. Using this graph we can analyze the behavior of using the batch gradient descend and its effect calculating the cost.

The descent of the cost function in both graphs is ten to 0 but using mim-batch we start to see a lot of notice, it happened because now the gradient descent for each batch can be different from one another. If you want to soft this plot you way use batch normalization that was just presented before.

Gradient descent with momentum

First, we must talk about EWMA but what is that? The Exponentially Weighted Moving Average (EWMA) is a quantitative or statistical measure used to model or describe a time series. The EWMA is widely used in finance, the main applications being technical analysis and volatility modeling.

The moving average is designed as such that older observations are given lower weights. The weights fall exponentially as the data point gets older – hence the name exponentially weighted.

The only decision a user of the EWMA must make is the parameter?alpha. The parameter decides how important the current observation is in the calculation of the EWMA. The higher the value of alpha, the more closely the EWMA tracks the original time series. Following the next eccuation, you will apply this method in gradient descent

Where beta is the weight decided by the user (0 < B < 1), dw and db represent the current gradient descent value in that point, and Vdw and Vdb are the exponentially weighted value. Now to apply to the gradient descent we follow the next formula.

RMSProp optimization

Root Mean Squared Propagation, or?RMSProp, is an extension of gradient descent and the AdaGrad version of gradient descent that uses a decaying average of partial gradients in the adaptation of the step size for each parameter. The use of a decaying moving average allows the algorithm to forget early gradients and focus on the most recently observed partial gradients seen during the progress of the search, overcoming the limitation of AdaGrad. Following the concept of gradient descent with the momentum we just change a little bit to apply.

Adam optimization

The?Adam optimization algorithm?is an extension to momentum gradient descent and SRMprop that has recently seen broader adoption for deep learning applications in computer vision and natural language processing. First, we must do the bias correction to couple the values to the same position. But the two previous equations maintain the same.

Now we have our parameter correct by bias approximation we proceed to apply the gradient descent using the following equations.

This procedure allows us to get way closer to the optimal point of our network but we must define a lot number of parameters, those are called hyperparameters those are values that can not be obtained by the mathematical treatment but instead must be proposed by the coder certain guidelines for the selection of these are alpha must be tune in respect of the other two methods, beta one close to the value 0.9 and beta two to 0.999 and epsilon 10 * e-8.

Learning rate decay

Learning rate decay is a technique for training modern neural networks. It starts training the network with a large learning rate and then slowly reducing/decaying it until local minima are obtained. It is empirically observed to help both optimization and generalization.

In the very first image where we have a constant learning rate, the steps taken by our algorithm while iterating towards minima are so noisy that after certain iterations it seems to wander around the minima and do not actually converge.

But in the second image where the learning rate is reducing over time (represented with green line), since the learning rate is large initially we still have relatively fast learning but as tending towards minima learning rate gets smaller and smaller, end up oscillating in a tighter region around minima rather than wandering far away from it.

要查看或添加评论，请登录

Juan David Tuta Botero的更多文章

How to perform automated data augmentation

2022年8月10日

How to perform automated data augmentation

Every day of our lives we see regularly how machine learning takes more and more prominence in the different economic…
How to create an RNN (Recurrent Neural Network) capable of predicting the behavior of the stock markets and cryptocurrencies

2022年5月11日

How to create an RNN (Recurrent Neural Network) capable of predicting the behavior of the stock markets and cryptocurrencies

In this article, we will see how to build an artificial intelligence using an RNN (Recurrent neural network) to predict…
Hyperparameters selection using Bayesian Optimization with GPyOpt over a Keras Neural Network

2022年4月18日

Hyperparameters selection using Bayesian Optimization with GPyOpt over a Keras Neural Network

One of the main concerns in the development of neural networks is the correct selection of hyperparameters. These are…
Transfer Learning, how to use pre-trained neural networks to apply to your own

2022年2月21日

Transfer Learning, how to use pre-trained neural networks to apply to your own

Abstract This review is provided a detailed overview of how to develop a Neural network able to recognize and…
Convolutional Neural Networks how Artificial intelligence see

2022年2月6日

Convolutional Neural Networks how Artificial intelligence see

Recently the idea to use Artificial intelligence to analyze images and videos is becoming a trending topic even for…
Activation functions in machine learning and Neural Networks

2022年1月5日

Activation functions in machine learning and Neural Networks

It’s been a while since the last article I wrote, today we are going to talk about activation functions, this concept…
What happened when you search in your browser?

2021年9月6日

What happened when you search in your browser?

The Internet has become one of the most important tools in the current society. It is hard to believe any activity that…
IoT is one step closer to the future

2021年8月19日

IoT is one step closer to the future

What is the first thing that comes to your mind when you think of the word Iot? In my case, I didn't know anything…
What is Recursion? Computer science

2021年6月17日

What is Recursion? Computer science

Every day of our lives we usually find notions to interpret the reality from where we are living, and how we perceived…

1 条评论
Differences between static and dynamic libraries

2021年5月4日

Differences between static and dynamic libraries

Why using libraries in general A library in C is a collection of objects files exposed for use and build other…

1 条评论

See all articles

Feature scaling

Batch normalization

Mini-batch gradient descent

Gradient descent with momentum

RMSProp optimization

Adam optimization

Learning rate decay

Juan David Tuta Botero的更多文章

How to perform automated data augmentation

How to create an RNN (Recurrent Neural Network) capable of predicting the behavior of the stock markets and cryptocurrencies

Hyperparameters selection using Bayesian Optimization with GPyOpt over a Keras Neural Network

Transfer Learning, how to use pre-trained neural networks to apply to your own

Convolutional Neural Networks how Artificial intelligence see

Activation functions in machine learning and Neural Networks

What happened when you search in your browser?

IoT is one step closer to the future

What is Recursion? Computer science

Differences between static and dynamic libraries

社区洞察