Nesterov Accelerated Gradient Descent
Dr.A.Sumithra Gavaskar
Associate Professor at Sns College of Technology , Research Co-ordinator of Dept of CSE
Gradient descent
It is essential to understand,?before we look at Nesterov Accelerated Algorithm. Gradient descent is an?optimization?algorithm that is used to train our model. The accuracy of a machine learning model is determined by the cost function. The lower the cost, the better our ML model is performing. Optimization algorithms are used to reach the minimum point of our cost function. Gradient descent is the most common optimization algorithm. It takes parameters at the start and then changes them iteratively to reach the minimum point of our cost function.
As we can see above, we take some initial weight, and according to that, we are positioned at some point on our cost function. Now, gradient descent tweaks the weight in each iteration, and we move towards the minimum of our cost function accordingly.?
The size of our steps depends on?the learning rate?of our model. The higher the learning rate, the higher the step size. Choosing the correct learning rate for our model is very important as it can cause problems while training.
A low learning rate assures us to reach the minimum point, but it takes a lot of iterations to train, while a very high learning rate can cause us to cross the minimum point, a problem commonly known as?overshooting.
领英推荐
Drawbacks of gradient descent
The main drawback of gradient descent is that it depends on the learning rate and the gradient of that particular step only. The gradient at the plateau, also known as?saddle points of our function, will be close to zero. The step size becomes very small or even zero. Thus, the update of our parameters is very slow at a gentle slope.
Let us look at an example. The starting point of our model is ‘A’. The loss function will decrease rapidly on the path AB because of the higher gradient. But as the gradient decreases from B to C, the learning is negligible. The gradient at point ‘C’ is zero, and it is the saddle point of our function. Even after many iterations, we will be stuck at ‘C’ and will not reach the desired minimum ‘D’.
Gradient descent with momentum
The issue discussed above can be solved by including the previous gradients in our calculation. The intuition behind this is if we are repeatedly asked to go in a particular direction, we can take bigger steps towards that direction.?
The weighted average of all the previous gradients is added to our equation, and it acts as momentum to our step.?
?