登录查看更多内容

Knowing What makes ML training converge

Deepak Kumar

Propelling AI To Reinvent The Future || Mentor|| Leader || Innovator || Machine learning Specialist || Distributed architecture | IoT | Cloud Computing

发布日期: 2021年2月23日

Why to read this?

An ML algorithm is said to converge (learns) when as the iterations proceed the output gets closer and closer to a specific value. In some circumstances, an algorithm will diverge; its output will undergo larger and larger oscillations, never approaching a useful result.

If you are interested to know about these conditions for convergence, then this document helps.

Technical explanation

Consider an example of a regression model. Say we have N number of data points given, and we want our model to predict the continuous values. With each prediction of model, there is a loss(error) associated with it.

A loss/error is an amount by which real value and predicted value differ. So the basic aim of any model is to minimize this loss/error. In below picture, the algorithm is reducing the loss/error (Note that error values are going down)

Convergence approach

When formulating a problem in machine learning, we need to come up with a loss function. which uses model weights as parameters. Back-propagation starts at an arbitrary point on the error manifold defined by the loss function and with every iteration intends to move closer to a point that minimises error value by updating the weights. Essentially for every possible set of weights the model can have, there is an associated loss for a given loss function, with our goal being to find the minimum point on this manifold.

This approach will lead to gradual reduction of loss as shown in below picture. Note that the loss value is decreasing(notice descending slope) in the below picture.

Convergence methods

Gradient descent approach

Gradient descent is an iterative optimization algorithm used in machine learning to minimize a loss function. The loss function describes how well the model will perform given the current set of parameters (weights and biases), and gradient descent is used to find the best set of parameters.

Gradient descent is slow for large training size. Note that it uses all training data for calculating the loss.

Stochastic gradient descent

Stochastic gradient descent is a very popular and common algorithm used in various Machine Learning algorithms, most importantly forms the basis of Neural Networks. A neural network model is trained using the stochastic gradient descent optimisation algorithm and weights are updated using the back-propagation of error algorithm.

It is similar to Gradient descent above with the change that loss is calculated based on random subset of data(mini batch). Due to this, this approach is relatively faster.

Conjugate gradient approach

In the conjugate gradient, we go like the Rook in Chess: either horizontally or vertically, never diagonal. This approach is often faster compared to gradient descent.

Convergence criteria

Learning rate

Rate of learning should be not very low or very high. There are optimisers like Adam which adjusts learning rate adaptively to handle this requirement. Refer here for more detail on this.

Convergence minima

For good Accuracy, ML training should converge to global minima. Attaining global minima depends on the loss function for the ML model at hand.

Attaining global minima

If loss function is convex, then algorithm will converge to global minima. Note that it has only one minima. So, those algorithms which uses convex loss function will naturally guarantee global minima. For example. Least square function (used in linear regression) uses convex loss function and so, it guarantees global minima. Refer here for discussion. Same is the case with logistic loss function(Refer here)

What about global minima for non-convex loss function?

Neural network loss functions are not necessarily convex. This paper talks about condition for attaining global minima for neural networks. Another paper talks about conditions on which neural network reaches global minima.

If learning rate is reduced slowly, then it helps model to attain global minima. Optimiser like Adam provides Adaptive learning rate. Refer this article for detail.

Point to remember

Loss function, cost function and objective functions are similar concept with different naming.

Reference

Thanks to these helping hands

https://images.app.goo.gl/ysYRoS1zjQFq7Y8W8

https://www.kdnuggets.com/2020/05/5-concepts-gradient-descent-cost-function.html

https://www.dhirubhai.net/pulse/optimization-convergence-machine-learning-algorithms-amlesh-kanekar/

https://ai.stackexchange.com/questions/16348/what-is-convergence-in-machine-learning

https://www.quora.com/What-is-convergence-in-the-context-of-Machine-Learning

https://images.app.goo.gl/VPH6gSXBiwvPknr48

https://images.app.goo.gl/y35fp6Ti4AYpPA6C9

https://images.app.goo.gl/yfz1dRWEp14bX5dq6

https://images.app.goo.gl/DxBZbZjRp4cS2sfU7

https://images.app.goo.gl/QbdRvEHHikGkcxPL6

https://medium.com/swlh/a-short-introduction-on-conjugate-gradien-d7faec192c4b

https://youtu.be/toruVYsc2mU?t=184

https://www.dhirubhai.net/posts/dpkumar_optimization-machinelearning-datascience-activity-6751430811356147712-oYo6

https://www.quora.com/Does-Gradient-Descent-Algo-always-converge-to-the-global-minimum

https://images.app.goo.gl/kmdt7nSaGm2EA4iF6

https://math.stackexchange.com/questions/2774106/why-is-the-least-square-cost-function-for-linear-regression-convex

https://www.cs.cmu.edu/~yandongl/loss.html

https://towardsdatascience.com/stochastic-gradient-descent-clearly-explained-53d239905d31

https://youtu.be/vMh0zPT0tLI

https://datascience.stackexchange.com/questions/36450/what-is-the-difference-between-gradient-descent-and-stochastic-gradient-descent

https://images.app.goo.gl/TA9C9kdhehdztt9x8

https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/

https://arxiv.org/pdf/1811.03804.pdf

https://openreview.net/pdf?id=BJk7Gf-CZ

https://stats.stackexchange.com/questions/90874/how-can-stochastic-gradient-descent-avoid-the-problem-of-a-local-minimum

https://youtu.be/ugOgmoeUAVs

https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

要查看或添加评论，请登录

Deepak Kumar的更多文章

Role of DBSCAN in machine learning

2023年12月21日

Role of DBSCAN in machine learning

Why to read this? Density-based spatial clustering of applications with noise (DBSCAN)is a well-known data clustering…
Choice between multithreading and multi-processing: When to use what

2023年12月20日

Choice between multithreading and multi-processing: When to use what

Introduction Single threaded and single process solution is normal practice. For example, if you open the text editor…
Artificial Narrow Intelligence

2023年12月18日

Artificial Narrow Intelligence

About ANI ANI stands for "Artificial Narrow Intelligence." ANI refers to artificial intelligence systems that are…
Federated learning and Vehicular IoT

2023年11月29日

Federated learning and Vehicular IoT

Definition Federated Learning is a machine learning paradigm that trains an algorithm across multiple decentralised…
An age old proven technique for image resizing

2023年7月14日

An age old proven technique for image resizing

Why to read? Anytime, was you curious to know how you are able to zoom small resolution picture to bigger size?…

1 条评论
Stock Market Volatility Index

2023年7月12日

Stock Market Volatility Index

Why? Traders and investors use the VIX index as a tool to gauge market sentiment and assess risk levels. It can help…
The case for De-normalisation in Machine learning

2023年7月8日

The case for De-normalisation in Machine learning

Why? The need for inverse normalization arises when you want to interpret or use the normalized data in its original…

1 条评论
Kubernetes complements Meta-verse

2023年7月4日

Kubernetes complements Meta-verse

Motivation The #metaverse is a virtual world or space that exists on the #internet . It's like a big interconnected…

1 条评论
Which one offers better Security- OSS or Proprietary software

2023年6月24日

Which one offers better Security- OSS or Proprietary software

Motivation World is using so many OSS. Apache Kafka is a core part of our infrastructure at LinkedIn Redis is core part…
Why chatGPT/LLM should have unlearning capability like human has..

2023年5月29日

Why chatGPT/LLM should have unlearning capability like human has..

Executive Summary Do you know, chatGPT/LLM has this open problem to solve. This problem(unlearn) has potential to…

1 条评论

See all articles

Knowing What makes ML training converge

Deepak Kumar

Propelling AI To Reinvent The Future || Mentor|| Leader || Innovator || Machine learning Specialist || Distributed architecture | IoT | Cloud Computing

Gradient descent approach

Stochastic gradient descent

Conjugate gradient approach

Learning rate

Convergence minima

Deepak Kumar的更多文章

社区洞察

其他会员也浏览了

Transformers without pain ??

Multilayer Network, Threshold Unit, Feedforward Network.

Dying ReLU Problem! - Keep your neural network alive...

What Are The Mechanics Of AI

How We Learn - A Book about Learning in Human Brain and Machines

The Subtle Influence of Machine Learning

Can One Input Alone Power Recommendations Engine? Neural Networks Say Yes!

Friendly?? Guide to Neural Net!

Configuring a Neural Network Output Layer

Backpropagation Algorithm

Gradient descent approach

Stochastic gradient descent

Conjugate gradient approach

Learning rate

Convergence minima

Deepak Kumar的更多文章

Role of DBSCAN in machine learning

Choice between multithreading and multi-processing: When to use what

Artificial Narrow Intelligence

Federated learning and Vehicular IoT

An age old proven technique for image resizing

Stock Market Volatility Index

The case for De-normalisation in Machine learning

Kubernetes complements Meta-verse

Which one offers better Security- OSS or Proprietary software

Why chatGPT/LLM should have unlearning capability like human has..

社区洞察

其他会员也浏览了

Transformers without pain ??

Multilayer Network, Threshold Unit, Feedforward Network.

Dying ReLU Problem! - Keep your neural network alive...

What Are The Mechanics Of AI

How We Learn - A Book about Learning in Human Brain and Machines

The Subtle Influence of Machine Learning

Can One Input Alone Power Recommendations Engine? Neural Networks Say Yes!

Friendly?? Guide to Neural Net!

Configuring a Neural Network Output Layer

Backpropagation Algorithm