Gradient Descent at your fingertips
“The beauty of quantum machine learning is that we do not need to depend on an algorithm like gradient descent or convex objective function. The objective function can be nonconvex or something else.”
Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. To find a local minimum of a function using gradient descent, we take steps proportional to the negative of the gradient of the function at the current point.
Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model.
Gradient descent is used to minimize a cost function J(W) parameterized by a model parameters W. The gradient (or derivative) tells us the incline or slope of the cost function. Hence, to minimize the cost function, we move in the direction opposite to the gradient
Gradient Descent is one of the most popular and widely used optimization algorithms. Let’s say that we are given a machine learning model parameterised by weights and biases and a cost function to evaluate how good a model is. Then our learning problem find the values of the parameter which minimize the cost function
These are the definitions provided from different sites, but sometimes these definitions may lead to confusion as they did not provide any practical explanation. In this article, I will try to cover the basic explanation of gradient descent along with an example.
Gradient Descent is an iterative method. We start with some arbitrary initial values for our model parameters the weights and biases and improve them slowly. In each iteration, we try to get a sense of the value of the cost function for weights similar to the current and move in the direction in which the cost function reduces. By repeating these steps thousands of times, we will continue to minimize our cost function.
Let’s say that we want to minimize the cost function J(W) parameterized by model parameters W
STEP 1: Initialize the weights W randomly g
STEP 2: Calculate the Gradients.
STEP 3: Update the weights using the formula
Here η is the learning rate which determines the size of steps it requires to reach the minimum. We will discuss how to pick a good value for the learning rate later in this tutorial. Notice the minus (-) sign in the update formula. The gradient gives us the direction in which the cost function increases. But we have to move in the direction in which the cost function decreases. That is the direction opposite to the gradient.
STEP 4: Repeat till J(W) stops reducing or some termination criteria are met. Typically we just check if the cost function reduced by at least 0.001 or some similar threshold. We may also have a limit on the maximum number of iterations allowed to say 10,000.
Gradient Descent for Linear Regression.
Here is the formula for the cost function that we are gonna use
This is the MSE (Mean Squared Error) that you must be familiar with linear regression. In this formula, b + w1x is the prediction of y(x).
The Gradient wrt w1 and b are given by these two equations. I will not show the step by step size derivation of the gradients here since calculating the formula for the gradient is not the focus here. However, if you’d like to try, you need to use calculus and in particular, the chain rule of differentiation.
Then we update the parameters using our Update Equation. We update W1 using Update formula. For Linear Regression we can visualize Gradient Descent as follows.
Here the x-axis and the y-axis represent the change in the weights w1 and bias b. The function values in the z-axis is the cost function. As we change the bias and weight values we keep descending to lower values of the cost function to arrive at the minima. The cost function for the linear regression shape like a bowl. Hence there is one minima, and the gradient descent always proceeds in the current direction. In general, gradient descent is also used when the cost function’s shape is not that straight forward.
The cost function may also have more than one minima
For many machine learning algorithms, there is only one minima, so gradient Descent finds the optimal solution but in general gradient descent often works quite well for machine learning algorithms where there is more than one minima
Choosing the Learning Rate.
When Learning rate is small, each step of descent is also extremely small. This means that it will take a very very long time to reach the local minima. On the other hand, a larger learning rate may converge faster, but if it is too large, it may overshoot the minima and fail to converge. Given this information, we can use the following strategy for finding a good value for the learning rate. We start with small values of η( Learning rate) such as 0.1, 0.01, 0.001 etc. Then we adapt it based on whether the cost function is reducing very slowly or exploding. If the cost function is reducing very slowly then increase η( Learning rate). If the cost function is exploding then decrease η( Learning rate). Although manually choosing the learning rate is still the most common practice, several methods such as ADAM optimizer, AdaGrad automatically chooses a suitable learning rate
Variants Of Gradient Descent
There are multiple variants of gradient descent depending on how much of the data is being used to calculate the gradient. The main reason, for this, is computational efficiency. A dataset may have millions of data points and calculating the gradient over the entire dataset can be computationally expensive.
Batch Gradient Descent computes the gradient of the cost function for the entire training data. Since we need to calculate the gradients for the whole dataset to perform one parameter update, batch gradient descent can be very slow
Stochastic Gradient Descent computes the gradient descent for each update using a single training data point x(i) which is chosen at random. The idea is the gradient calculated this way is an approximation to the gradient calculated using the entire training data. Each update is now much faster than in batch gradient descent, and over many updates, we will head in the same general directions.
Mini Batch Gradient Descent in this, we calculate the gradient for each small mini-batch of training data. That is, we first divide the training data into small batches, say ‘M’’ data points per batch. We perform one update per mini-batch. M is usually is in the range 30 to 500, depending on the problem. Usually, mini-batch gradient descent is used because computing infrastructures such as compilers, CPU’s, GPU’s are often optimized to perform vector additions and multiplications.
Out of all these Stochastic Gradient and Mini-Batch Gradient Descent are the most popular. In a typical scenario, we do several passes over the training data before the termination criteria are met.
Example of Gradient Descent
Let’s take a concrete example which combines everything we learnt.
Let x = (x1, x2) = (2, -3) and y = 1 where x1 and x2 are the training data point and y is our output variable. In addition, we need to choose what our model and cost function are. Let's say we are doing Linear Regression. Prediction is given by P
Prediction P = w1x1 + w2x2 + b where w1 and w2 are the random weights and b is the bias.
STEP 1:
Initialize the weights (at random) and bias. Let’s say (w1, w2) = (0.1, 0.2) and b = 0
STEP 2:
For each data point in training data:
- Calculate the Prediction
- Calculate the cost
- Calculate the gradient.
- Update the weights.
(repeat the process until the termination condition is met, the termination condition is that condition where there is almost no change in the value of the cost function even after iterations and changing weights.)
i) Calculate the Prediction
P = w1x1 + w2x2 +
= 2w1 - 3w2 + b
= 0.2 - 0.6
= -0.4
ii) Calculate the Cost
J(w) = (p - y)^2
= (-1.4) ^ 2
= 1.96
iii) Calculate the Gradient
We are doing partial differentiation as ‘w’ is a vector here
iv) Update the weights
This gives us new weights (w1, w2) = (0.1056, 0.1916) and bias = 0.0028
STEP 3: Check if termination criteria is met (see if mean-squared-error) reduces or not.
{ If ( MSE (Mean-Squared-Error) is reduced) : Repeat from STEP 2 ELSE: STOP }
NOTE: Predictions made using new weights should be better.
RESOURCES
1) StatQuest
2) 3Blue1Brown
THANK YOU...