登录查看更多内容

Stochastic Gradient Descent Overview

Sarveshwaran Rajagopal

Data Scientist and Trainer (Gen AI & NLP) | Empowered 7000+ Professionals & Students to Excel in Data Science ?? | ?? Speaker and Thought Leader in Data Science ??

发布日期: 2023年2月20日

+ 关注

Points discussed:

Optimization
Optimization Function
Loss Function
Derivative
Optimization Methods
Gradient Descent
Stochastic Gradient Descent

What is the need for Optimization?

Any algorithm has an objective of reducing the error, reduction in error is achieved by optimization techniques.

Error: Cross-Entropy Loss in Logistic Regression, Sum of Squared Loss in Linear Regression

What is the Optimization function?

Optimization is a mathematical technique of either minimizing or maximizing some function?f(x)?by altering x. In many real-time scenarios, we will be minimizing f(x). Well, how do we maximize the f(x) if the thought comes up? It is just possible by minimizing the -f(x). Any function?f(x)?that we want to either minimize or maximize we call the?objective function or criterion.

What is a Loss function?

Whenever we minimize our objective function?f(x)?we call it a loss function. Loss function takes various names such as?Cost function or error function.?

Derivative:

Let’s take an example of a function y = f(x), where both x and y are real numbers. The derivative of this function is denoted as f’(x) or as dx/dy. The derivative f’(x) gives the slope of f(x) at the point x. In other words, it directs us to how a small change in the input will correspond to the change in output. The derivative is useful to minimize the loss because it tells us how to change x to reduce the error or to make a slight improvement in y.

Optimization Methods:

No alt text provided for this image — Optimization Methods

1. Unconstrained – Closed Form Solution:

Steepest descent convergence is when every element of the gradient is zero (at least very close to zero). In some cases, we may be able to avoid running an iterative algorithm and just jump to the critical point by solving the equation?Δxfx=0 for x.

Why is the Iterative method more robust than the closed form?

Closed-form works well for simpler cases, If the cost function has many variables, minimizing using the closed form will be complicated
Iterative methods will help us reach local minima, most of the time reaching the global minima would be an unaccomplished task.

Gradient Descent (First Order Iterative Method):

Gradient Descent is an iterative method. You start at some Gradient (or) Slope and based on the slope, take a step of the descent. The technique of moving x in small steps with the opposite sign of the derivative is called Gradient Descent. In other words, the positive gradient points direct uphill, and the negative gradient points direct downhill. We can decrease the value by moving in the direction of the negative gradient. This is known as the method of?steepest descent?or?gradient descent.

A new point is proposed:

Where is the learning rate, a positive scalar determining the size of the step? Choose a small constant. A popular method to reach with optimum learning rate (?) is by using the?grid search or line search.

Choice of Learning Rate:

Larger learning rate ?? Chances of missing the global minima, as the learning curve will show violent oscillations with the cost function increasing significantly.
Small learning rate ?? Convergence is slow and if the learning rate is too low, learning may become stuck with high-cost value.

Gradient descent is limited to optimization in continuous spaces, the general concept of repeatedly making a best small move (either positive or negative) toward the better configurations can be generalized to discrete spaces.

Gradient Descent can be associated with the ball rolling down from a valley and the lowest point is the steepest descent, learning rate (?) consider it as the steps taken by the ball to reach the lowest point of the valley.

For example, let’s consider the below function as the cost function:

Industry Research Biz 1 年前

Poisson-binomial Stochastic Processes: Introduction…

Vincent Granville 2 年前

The Art of Model Tuning: Mastering Grid Search, Random…

Sanjay Kumar MBA,MS,PhD 8 个月前

Step-by-step approach:

Start with an initial assumed parameter ?? assumed value=(x); = learning rate
For the value (x), you calculate the output of the differentiated function which we denote as f’(x)
Now, the value of parameter ?? x – (?*f'(x))
Continue this same process until the algorithm reaches an optimum point (α)
Error reduces because the cost function is convex in nature.

The question arises when the derivative function or f’(x) = 0, in that situation the derivative provides no information about which direction to move. Points where f’(x) = 0 are known as?critical points?or?stationary points.

Few other important terminologies to know before we move to SGD:

Local Minimum
Local Maximum
Global Minimum
Saddle Points
Jacobian?Matrix

Local Minimum:

A local minimum is a point where f(x) is lower than all neighbouring points, so it is no longer possible to decrease f(x) by making baby steps. This is also called local minima (or) relative minimum.

Local Maximum:

Local Maximum is a point where f(x) is higher than at all neighbouring points so it is impossible to increase f(x) by taking baby steps. This is also called a local?maxima (or)?relative maximum.

Saddle Points:?????????????????????????????????????????????

Some critical points or stationary points are neither maxima nor minima, they are called Saddle points.

Global Minimum:

The global minimum is the smallest value of the function f(x) which is also called the?absolute minimum. There would be only one global minimum whereas there could be more than one or more local minimum.?

When the input to the function is multidimensional most of the optimization functions fail to find the global minimum as they have multiple local minima, local maxima and saddle points.?

This is one of the greatest challenges in optimization, we often settle with the value of f that is minimal not necessarily global minimum in any formal sense.

What is the Jacobian Matrix?

When we have multiple inputs, we must use?partial derivatives?of each?variable?xi.?The partial derivative a/axi f(x) measures how?f?changes as only the variable xi increases at point x. The gradient of?f??is the vector containing all the partial derivatives, denoted by xf(x). In multiple dimensions, critical points are points where every element of the gradient is equal to zero.?

Sometimes we need to find all derivatives of a function whose input and output are both vectors. The matrix containing all such partial derivatives is known as the Jacobian?Matrix.?Discussion of?Jacobian and Hessian matrices is beyond the scope of the current discussion.

Challenges in Gradient Descent:

For a good generalization, we should have a large training set, which comes with a?huge computational cost.?
i.e., as the training set grows to billions of examples, the time taken to take a?single gradient step?becomes long.

Stochastic Gradient Descent:

Stochastic Gradient Descent is the extension of Gradient Descent.

Any Machine Learning/ Deep Learning function works on the same objective function f(x) to reduce the error and generalize when new data comes in.?

To overcome the challenges in Gradient Descent we are taking a small set of samples, specifically on each step of the algorithm, we can sample a?minibatch?drawn uniformly from the training set. The minibatch size is typically chosen to be a relatively small number of examples; it could be from?one to a few hundred.?

Using the examples from the minibatch. The SGD algorithm then follows the expected gradient downhill:

The Gradient Descent has often been regarded as slow or unreliable, it was not feasible to deal with non-convex optimization problems. Now with Stochastic Gradient Descent, machine learning algorithms work very well when trained, though it reaches the local minimum in a reasonable amount of time.?

A crucial parameter for SGD is the learning rate, it is necessary to decrease the learning rate over time, so we now denote the learning rate at iteration k as Ek.

要查看或添加评论，请登录

查看全部

Stochastic Gradient Descent Overview

Sarveshwaran Rajagopal

Data Scientist and Trainer (Gen AI & NLP) | Empowered 7000+ Professionals & Students to Excel in Data Science ?? | ?? Speaker and Thought Leader in Data Science ??

What is the need for Optimization?

What is the Optimization function?

What is a Loss function?

Derivative:

Optimization Methods:

1. Unconstrained – Closed Form Solution:

Why is the Iterative method more robust than the closed form?

Gradient Descent (First Order Iterative Method):

Choice of Learning Rate:

领英推荐

Step-by-step approach:

Local Minimum:

Local Maximum:

Saddle Points:?????????????????????????????????????????????

Global Minimum:

What is the Jacobian Matrix?

Challenges in Gradient Descent:

Stochastic Gradient Descent:

更多精彩文章

社区洞察

其他会员也浏览了

Unlocking the Power of Machine Learning: The Right Algorithm for Every Problem, Real-World Impact in Every Solution

Modeling Steps for Linear Regression

Why Mean Squared Error for Linear Regression?

INNOVATION IN ADVANCED ANALYTICS. IF YOU WISH TO BE AN INNOVATOR DOA-DW CAN HELP YOU WITH THE KNOWLEDGE CAPITAL THAT YOU NEED

Stochastic Optimization & Advanced Analytics Software Factory

Linear Regression

Linear Regression. Making Sense Of The Future Based On The Past.

Navigating Uncertainty: Fuzzy Logic Applications

MCDM and Fuzzy Logic!

Exploring parameters optimization algorithms ; A comparative high-level overview

What is the need for Optimization?

What is the Optimization function?

What is a Loss function?

Derivative:

Optimization Methods:

1. Unconstrained – Closed Form Solution:

Why is the Iterative method more robust than the closed form?

Gradient Descent (First Order Iterative Method):

Choice of Learning Rate:

领英推荐

Step-by-step approach:

Local Minimum:

Local Maximum:

Saddle Points:?????????????????????????????????????????????

Global Minimum:

What is the Jacobian Matrix?

Challenges in Gradient Descent:

Stochastic Gradient Descent:

Unlocking the Art of Research Paper Reading with Google Bard: A Guide for College Students and Professionals

2024年1月9日

Optimizing Your Resume for ATS with ChatGPT and Online Tools

2024年1月4日

The Rise of Large Language Models: Understanding the Latest Developments in AI

2023年4月8日

Data Science Redefined: Explore the Latest Developments in AI and Machine Learning for Data Scientists

2023年4月1日

How to Create Effective Data Visualizations in Tableau

2023年2月25日

Road Map to a Successful AI/ML Implementation - Key points from Harvard Briefing Paper

2023年2月22日

Thought process to create an indicator variable for rural prosperity at the District level every year using Satellite Imagery

2019年1月22日

Can loan foreclosure be identified forehand analytically?; Business rationale involved in loan foreclosures:

2018年12月11日

Applying stats in the field of Rice Mill:

2018年12月10日

Is it wise to use CART technique when the dependent variable is skewed towards one of the class?

2018年12月10日

社区洞察

其他会员也浏览了

Unlocking the Power of Machine Learning: The Right Algorithm for Every Problem, Real-World Impact in Every Solution

Modeling Steps for Linear Regression

Why Mean Squared Error for Linear Regression?

INNOVATION IN ADVANCED ANALYTICS. IF YOU WISH TO BE AN INNOVATOR DOA-DW CAN HELP YOU WITH THE KNOWLEDGE CAPITAL THAT YOU NEED

Stochastic Optimization & Advanced Analytics Software Factory

Linear Regression

Linear Regression. Making Sense Of The Future Based On The Past.

Navigating Uncertainty: Fuzzy Logic Applications

MCDM and Fuzzy Logic!

Exploring parameters optimization algorithms ; A comparative high-level overview