Role of optimiser in machine learning
Deepak Kumar
Propelling AI To Reinvent The Future ||Author|| 150+ Mentorship|| Leader || Innovator || Machine learning Specialist || Distributed architecture | IoT | Cloud Computing
Why needed?
In machine learning, learning rate decides when the training converges. If your learning rate is set too low, training will progress very slowly as you are making very tiny updates to the weights in your network. However, if your learning rate is set too high, it can cause undesirable divergent behaviour in your loss function.
So how do we find the optimal learning rate? Optimiser is the answer.
Technical explanation
One of the key hyperparameters to set in order to train a neural network is the learning rate for gradient descent.
The issue with learning rate schedules is that they all depend on hyperparameters that must be manually chosen for each given learning session and may vary greatly depending on the problem at hand or the model used. To combat this there are many different types of adaptive gradient descent algorithms such as Adagrad, Adadelta, RMSprop, Adam which are generally built into deep learning libraries such as Keras. Detailed list of optimisers are here
Adam optimiser
Adaptive Moment Estimation is most popular today.
ADAM computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients vt like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients mt, similar to momentum
Point to remember
Optimiser solves gradient descent problem(α in below picture), not the vanishing gradient problem
Reference
Time to thank these helping hands
https://www.jeremyjordan.me/nn-learning-rate/ https://en.wikipedia.org/wiki/Learning_rate https://towardsdatascience.com/gradient-descent-algorithms-and-adaptive-learning-rate-adjustment-methods-79c701b086be https://keras.io/api/optimizers/