Knowing What makes ML training converge
https://www.researchgate.net/figure/A-typical-energy-landscape-depicting-position-of-several-local-minima-which-indicate-the_fig3_45900660

Knowing What makes ML training converge

Why to read this?

An ML algorithm is said to converge (learns) when as the iterations proceed the output gets closer and closer to a specific value. In some circumstances, an algorithm will diverge; its output will undergo larger and larger oscillations, never approaching a useful result. 

If you are interested to know about these conditions for convergence, then this document helps.


Technical explanation


Consider an example of a regression model. Say we have N number of data points given, and we want our model to predict the continuous values. With each prediction of model, there is a loss(error) associated with it.

A loss/error is an amount by which real value and predicted value differ. So the basic aim of any model is to minimize this loss/error. In below picture, the algorithm is reducing the loss/error (Note that error values are going down)

No alt text provided for this image


Convergence approach


When formulating a problem in machine learning, we need to come up with a loss function. which uses model weights as parameters. Back-propagation starts at an arbitrary point on the error manifold defined by the loss function and with every iteration intends to move closer to a point that minimises error value by updating the weights. Essentially for every possible set of weights the model can have, there is an associated loss for a given loss function, with our goal being to find the minimum point on this manifold. 

This approach will lead to gradual reduction of loss as shown in below picture. Note that the loss value is decreasing(notice descending slope) in the below picture.

No alt text provided for this image


Convergence methods

Gradient descent approach

Gradient descent is an iterative optimization algorithm used in machine learning to minimize a loss function. The loss function describes how well the model will perform given the current set of parameters (weights and biases), and gradient descent is used to find the best set of parameters.


Gradient descent is slow for large training size. Note that it uses all training data for calculating the loss.

No alt text provided for this image


  

Stochastic gradient descent

Stochastic gradient descent is a very popular and common algorithm used in various Machine Learning algorithms, most importantly forms the basis of Neural Networks. A neural network model is trained using the stochastic gradient descent optimisation algorithm and weights are updated using the back-propagation of error algorithm.

It is similar to Gradient descent above with the change that loss is calculated based on random subset of data(mini batch). Due to this, this approach is relatively faster

No alt text provided for this image


Conjugate gradient approach


In the conjugate gradient, we go like the Rook in Chess: either horizontally or vertically, never diagonal. This approach is often faster compared to gradient descent.

No alt text provided for this image



Convergence criteria

Learning rate

Rate of learning should be not very low or very high. There are optimisers like Adam which adjusts learning rate adaptively to handle this requirement. Refer here for more detail on this.

No alt text provided for this image



Convergence minima


For good Accuracy, ML training should converge to global minima. Attaining global minima depends on the loss function for the ML model at hand.


No alt text provided for this image


Attaining global minima

If loss function is convex, then algorithm will converge to global minima. Note that it has only one minima. So, those algorithms which uses convex loss function will naturally guarantee global minima. For example. Least square function (used in linear regression) uses convex loss function and so, it guarantees global minima. Refer here for discussion. Same is the case with logistic loss function(Refer here)

No alt text provided for this image



What about global minima for non-convex loss function?

Neural network loss functions are not necessarily convex. This paper talks about condition for attaining global minima for neural networks. Another paper talks about conditions on which neural network reaches global minima.


If learning rate is reduced slowly, then it helps model to attain global minima. Optimiser like Adam provides Adaptive learning rate. Refer this article for detail.

Point to remember

Loss function, cost function and objective functions are similar concept with different naming. 

Reference
Thanks to these helping hands
https://images.app.goo.gl/ysYRoS1zjQFq7Y8W8

https://www.kdnuggets.com/2020/05/5-concepts-gradient-descent-cost-function.html

https://www.dhirubhai.net/pulse/optimization-convergence-machine-learning-algorithms-amlesh-kanekar/

https://ai.stackexchange.com/questions/16348/what-is-convergence-in-machine-learning

https://www.quora.com/What-is-convergence-in-the-context-of-Machine-Learning

https://images.app.goo.gl/VPH6gSXBiwvPknr48

https://images.app.goo.gl/y35fp6Ti4AYpPA6C9

https://images.app.goo.gl/yfz1dRWEp14bX5dq6

https://images.app.goo.gl/DxBZbZjRp4cS2sfU7

https://images.app.goo.gl/QbdRvEHHikGkcxPL6

https://medium.com/swlh/a-short-introduction-on-conjugate-gradien-d7faec192c4b

https://youtu.be/toruVYsc2mU?t=184

https://www.dhirubhai.net/posts/dpkumar_optimization-machinelearning-datascience-activity-6751430811356147712-oYo6

https://www.quora.com/Does-Gradient-Descent-Algo-always-converge-to-the-global-minimum

https://images.app.goo.gl/kmdt7nSaGm2EA4iF6

https://math.stackexchange.com/questions/2774106/why-is-the-least-square-cost-function-for-linear-regression-convex

https://www.cs.cmu.edu/~yandongl/loss.html

https://towardsdatascience.com/stochastic-gradient-descent-clearly-explained-53d239905d31

https://youtu.be/vMh0zPT0tLI

https://datascience.stackexchange.com/questions/36450/what-is-the-difference-between-gradient-descent-and-stochastic-gradient-descent

https://images.app.goo.gl/TA9C9kdhehdztt9x8

https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/

https://arxiv.org/pdf/1811.03804.pdf

https://openreview.net/pdf?id=BJk7Gf-CZ

https://stats.stackexchange.com/questions/90874/how-can-stochastic-gradient-descent-avoid-the-problem-of-a-local-minimum

https://youtu.be/ugOgmoeUAVs

https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291


要查看或添加评论,请登录

Deepak Kumar的更多文章

  • Role of DBSCAN in machine learning

    Role of DBSCAN in machine learning

    Why to read this? Density-based spatial clustering of applications with noise (DBSCAN)is a well-known data clustering…

  • Choice between multithreading and multi-processing: When to use what

    Choice between multithreading and multi-processing: When to use what

    Introduction Single threaded and single process solution is normal practice. For example, if you open the text editor…

  • Artificial Narrow Intelligence

    Artificial Narrow Intelligence

    About ANI ANI stands for "Artificial Narrow Intelligence." ANI refers to artificial intelligence systems that are…

  • Federated learning and Vehicular IoT

    Federated learning and Vehicular IoT

    Definition Federated Learning is a machine learning paradigm that trains an algorithm across multiple decentralised…

  • An age old proven technique for image resizing

    An age old proven technique for image resizing

    Why to read? Anytime, was you curious to know how you are able to zoom small resolution picture to bigger size?…

    1 条评论
  • Stock Market Volatility Index

    Stock Market Volatility Index

    Why? Traders and investors use the VIX index as a tool to gauge market sentiment and assess risk levels. It can help…

  • The case for De-normalisation in Machine learning

    The case for De-normalisation in Machine learning

    Why? The need for inverse normalization arises when you want to interpret or use the normalized data in its original…

    1 条评论
  • Kubernetes complements Meta-verse

    Kubernetes complements Meta-verse

    Motivation The #metaverse is a virtual world or space that exists on the #internet . It's like a big interconnected…

    1 条评论
  • Which one offers better Security- OSS or Proprietary software

    Which one offers better Security- OSS or Proprietary software

    Motivation World is using so many OSS. Apache Kafka is a core part of our infrastructure at LinkedIn Redis is core part…

  • Why chatGPT/LLM should have unlearning capability like human has..

    Why chatGPT/LLM should have unlearning capability like human has..

    Executive Summary Do you know, chatGPT/LLM has this open problem to solve. This problem(unlearn) has potential to…

    1 条评论

社区洞察

其他会员也浏览了