Optimization operations in supervised learning and hyperparameter choices

Optimization operations in supervised learning and hyperparameter choices

In the current world, it’s hard to imagine one industry or company that is not interested to implement machine learning and artificial intelligence in their process. But at the first glance to have proper models and correctly train them you may need really powerful machines, that’s was the truth at the beginning of the past decade, but the current developments in optimization techniques in the learning rates applied in neural networks had become a reality where anyone with the proper knowledge of them can be competitive

Feature scaling

Feature scaling?is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the?data processing?step.

Since the range of values of raw data varies widely, in some machine learning?algorithms, objective functions will not work properly without normalization. For example, many?classifiers?calculate the distance between two points by the?euclidian distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

No hay texto alternativo para esta imagen

Batch normalization

Batch normalization?(also known as?a batch norm) is a method used to make?artificial neural networks?faster and more stable through normalization of the layers' inputs by re-centering and re-scaling.

While the effect of batch normalization is evident, the reasons behind its effectiveness remain under discussion. It was believed that it can mitigate the problem of?internal covariate shift, where parameter initialization and changes in the distribution of the inputs of each layer affect the learning rate of the network.?Recently, some scholars have argued that batch normalization does not reduce internal covariate shift, but rather smooths the?objective function, which in turn improves the performance.

No hay texto alternativo para esta imagen

Mini-batch gradient descent

The normal gradient descent in simple words is a vector generated by the neural network that allows us to interact between the parameters and a cost function to find the minimum error in the system we are working with. But what happens when the training set gets too big, well if we analyzed traditionally for each training element we will do a correcting by the gradient descent but it will take too much time and required processor power so instead we will separate them in batches to reduce the number of iteration to complete the task. As an example, We have a 5' 000 000 data set to test and we are going to create batches of 1000 to optimize the procedure of training.

No hay texto alternativo para esta imagen

Doing this instead of passing through a 5' 000 000 times each epoch we are just going to do it five thousand times. Using this graph we can analyze the behavior of using the batch gradient descend and its effect calculating the cost.

No hay texto alternativo para esta imagen

The descent of the cost function in both graphs is ten to 0 but using mim-batch we start to see a lot of notice, it happened because now the gradient descent for each batch can be different from one another. If you want to soft this plot you way use batch normalization that was just presented before.

Gradient descent with momentum

First, we must talk about EWMA but what is that? The Exponentially Weighted Moving Average (EWMA) is a quantitative or statistical measure used to model or describe a time series. The EWMA is widely used in finance, the main applications being technical analysis and volatility modeling.

The moving average is designed as such that older observations are given lower weights. The weights fall exponentially as the data point gets older – hence the name exponentially weighted.

The only decision a user of the EWMA must make is the parameter?alpha. The parameter decides how important the current observation is in the calculation of the EWMA. The higher the value of alpha, the more closely the EWMA tracks the original time series. Following the next eccuation, you will apply this method in gradient descent

No hay texto alternativo para esta imagen

Where beta is the weight decided by the user (0 < B < 1), dw and db represent the current gradient descent value in that point, and Vdw and Vdb are the exponentially weighted value. Now to apply to the gradient descent we follow the next formula.

No hay texto alternativo para esta imagen
No hay texto alternativo para esta imagen

RMSProp optimization

Root Mean Squared Propagation, or?RMSProp, is an extension of gradient descent and the AdaGrad version of gradient descent that uses a decaying average of partial gradients in the adaptation of the step size for each parameter. The use of a decaying moving average allows the algorithm to forget early gradients and focus on the most recently observed partial gradients seen during the progress of the search, overcoming the limitation of AdaGrad. Following the concept of gradient descent with the momentum we just change a little bit to apply.

No hay texto alternativo para esta imagen

Adam optimization

The?Adam optimization algorithm?is an extension to momentum gradient descent and SRMprop that has recently seen broader adoption for deep learning applications in computer vision and natural language processing. First, we must do the bias correction to couple the values to the same position. But the two previous equations maintain the same.

No hay texto alternativo para esta imagen

Now we have our parameter correct by bias approximation we proceed to apply the gradient descent using the following equations.

No hay texto alternativo para esta imagen

This procedure allows us to get way closer to the optimal point of our network but we must define a lot number of parameters, those are called hyperparameters those are values that can not be obtained by the mathematical treatment but instead must be proposed by the coder certain guidelines for the selection of these are alpha must be tune in respect of the other two methods, beta one close to the value 0.9 and beta two to 0.999 and epsilon 10 * e-8.

Learning rate decay

Learning rate decay is a technique for training modern neural networks. It starts training the network with a large learning rate and then slowly reducing/decaying it until local minima are obtained. It is empirically observed to help both optimization and generalization.

No hay texto alternativo para esta imagen

In the very first image where we have a constant learning rate, the steps taken by our algorithm while iterating towards minima are so noisy that after certain iterations it seems to wander around the minima and do not actually converge.

No hay texto alternativo para esta imagen

But in the second image where the learning rate is reducing over time (represented with green line), since the learning rate is large initially we still have relatively fast learning but as tending towards minima learning rate gets smaller and smaller, end up oscillating in a tighter region around minima rather than wandering far away from it.

No hay texto alternativo para esta imagen





要查看或添加评论,请登录

Juan David Tuta Botero的更多文章

社区洞察