Knowing What makes ML training converge
Deepak Kumar
Propelling AI To Reinvent The Future || Mentor|| Leader || Innovator || Machine learning Specialist || Distributed architecture | IoT | Cloud Computing
Why to read this?
An ML algorithm is said to converge (learns) when as the iterations proceed the output gets closer and closer to a specific value. In some circumstances, an algorithm will diverge; its output will undergo larger and larger oscillations, never approaching a useful result.
If you are interested to know about these conditions for convergence, then this document helps.
Technical explanation
Consider an example of a regression model. Say we have N number of data points given, and we want our model to predict the continuous values. With each prediction of model, there is a loss(error) associated with it.
A loss/error is an amount by which real value and predicted value differ. So the basic aim of any model is to minimize this loss/error. In below picture, the algorithm is reducing the loss/error (Note that error values are going down)
Convergence approach
When formulating a problem in machine learning, we need to come up with a loss function. which uses model weights as parameters. Back-propagation starts at an arbitrary point on the error manifold defined by the loss function and with every iteration intends to move closer to a point that minimises error value by updating the weights. Essentially for every possible set of weights the model can have, there is an associated loss for a given loss function, with our goal being to find the minimum point on this manifold.
This approach will lead to gradual reduction of loss as shown in below picture. Note that the loss value is decreasing(notice descending slope) in the below picture.
Convergence methods
Gradient descent approach
Gradient descent is an iterative optimization algorithm used in machine learning to minimize a loss function. The loss function describes how well the model will perform given the current set of parameters (weights and biases), and gradient descent is used to find the best set of parameters.
Gradient descent is slow for large training size. Note that it uses all training data for calculating the loss.
Stochastic gradient descent
Stochastic gradient descent is a very popular and common algorithm used in various Machine Learning algorithms, most importantly forms the basis of Neural Networks. A neural network model is trained using the stochastic gradient descent optimisation algorithm and weights are updated using the back-propagation of error algorithm.
It is similar to Gradient descent above with the change that loss is calculated based on random subset of data(mini batch). Due to this, this approach is relatively faster.
Conjugate gradient approach
In the conjugate gradient, we go like the Rook in Chess: either horizontally or vertically, never diagonal. This approach is often faster compared to gradient descent.
Convergence criteria
Learning rate
Rate of learning should be not very low or very high. There are optimisers like Adam which adjusts learning rate adaptively to handle this requirement. Refer here for more detail on this.
Convergence minima
For good Accuracy, ML training should converge to global minima. Attaining global minima depends on the loss function for the ML model at hand.
Attaining global minima
If loss function is convex, then algorithm will converge to global minima. Note that it has only one minima. So, those algorithms which uses convex loss function will naturally guarantee global minima. For example. Least square function (used in linear regression) uses convex loss function and so, it guarantees global minima. Refer here for discussion. Same is the case with logistic loss function(Refer here)
What about global minima for non-convex loss function?
Neural network loss functions are not necessarily convex. This paper talks about condition for attaining global minima for neural networks. Another paper talks about conditions on which neural network reaches global minima.
If learning rate is reduced slowly, then it helps model to attain global minima. Optimiser like Adam provides Adaptive learning rate. Refer this article for detail.
Point to remember
Loss function, cost function and objective functions are similar concept with different naming.
Reference
Thanks to these helping hands
https://images.app.goo.gl/ysYRoS1zjQFq7Y8W8 https://www.kdnuggets.com/2020/05/5-concepts-gradient-descent-cost-function.html https://www.dhirubhai.net/pulse/optimization-convergence-machine-learning-algorithms-amlesh-kanekar/ https://ai.stackexchange.com/questions/16348/what-is-convergence-in-machine-learning https://www.quora.com/What-is-convergence-in-the-context-of-Machine-Learning https://images.app.goo.gl/VPH6gSXBiwvPknr48 https://images.app.goo.gl/y35fp6Ti4AYpPA6C9 https://images.app.goo.gl/yfz1dRWEp14bX5dq6 https://images.app.goo.gl/DxBZbZjRp4cS2sfU7 https://images.app.goo.gl/QbdRvEHHikGkcxPL6 https://medium.com/swlh/a-short-introduction-on-conjugate-gradien-d7faec192c4b https://youtu.be/toruVYsc2mU?t=184 https://www.dhirubhai.net/posts/dpkumar_optimization-machinelearning-datascience-activity-6751430811356147712-oYo6 https://www.quora.com/Does-Gradient-Descent-Algo-always-converge-to-the-global-minimum https://images.app.goo.gl/kmdt7nSaGm2EA4iF6 https://math.stackexchange.com/questions/2774106/why-is-the-least-square-cost-function-for-linear-regression-convex https://www.cs.cmu.edu/~yandongl/loss.html https://towardsdatascience.com/stochastic-gradient-descent-clearly-explained-53d239905d31 https://youtu.be/vMh0zPT0tLI https://datascience.stackexchange.com/questions/36450/what-is-the-difference-between-gradient-descent-and-stochastic-gradient-descent https://images.app.goo.gl/TA9C9kdhehdztt9x8 https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/ https://arxiv.org/pdf/1811.03804.pdf https://openreview.net/pdf?id=BJk7Gf-CZ https://stats.stackexchange.com/questions/90874/how-can-stochastic-gradient-descent-avoid-the-problem-of-a-local-minimum https://youtu.be/ugOgmoeUAVs https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291