Gradient Descent and Backpropagation

Gradient Descent and Backpropagation

In previous articles, I have referred to the concepts of gradient descent and backpropagation for many times. But I did not give the details and implementations of them (the truth is, I didn't know these either. LOL). To eliminate this gap, I will share my understanding of these two concepts in this article.


The relationship between gradient descent and backpropagation

In machine learning, gradient descent and backpropagation often appear at the same time, and sometimes they can replace each other. So what is the relationship between them? In fact, we can consider backpropagation as a subset of gradient descent, which is the implementation of gradient descent in multi-layer neural networks. Since the same training rule recursively exists in each layer of the neural network, we can calculate the contribution of each weight to the total error inversely from the output layer to the input layer, which is so-called backpropagation. It doesn't matter if you don't quite understand what I'm talking about. Along with you getting deeper into this article, my statement above will make more sense to you.

Gradient Descent

Gradient descent is a first-order iterative optimization algorithm, which is used to find the local minima or global minima of a function. The algorithm itself is not hard to understand, which is:

  1. Starting from a point on the graph of a function;
  2. Find a direction ▽F(a) from that point, in which the function decreases fastest;
  3. Go (down) along this direction a small step γ, got to a new point of a+1;

By iterating the above three steps, we can find the local minima or global minima of this function.

Why always emphasize "local minima OR global minima" because they are two different concepts. As shown below:

This is a non-convex function, which has one local minima and one global minima. If we fall into the local minima in the process of gradient descent, it is not easy to climb out and find the global minima, which means we cannot get the best result. In another words, it is generally easier to get better result when using gradient descent on a convex function. In machine leaning, the function that needs to be minimized by gradient descent is the loss function. This explains why the ideal loss function should be a convex function.

So, how do we find the steepest gradient for a point? Very simple, we just differentiate the function. The derivative function represents the steepest gradient for that point. This explains why the ideal loss function should also be differentiable everywhere.

By knowing the current point and the steepest direction to go, it seems that we just need to take one small step in that direction to complete one iteration of gradient descent. However, we still need to have a thought about the size of that small step. In machine learning, this step size is called learning rate. The learning rate cannot be too large, otherwise it is easy to miss the minima. It cannot be too small either, otherwise the convergence of gradient descent will be too slow. Normally, the initial value of learning rate is set within the range of 10^-1 to 10^3. In addition, although the learning rate is a constant under common situation, there are also some algorithms can be used to dynamically adjust it during the training in order to achieve better results.

Now, let’s use a classic analogy to understand the gradient descent. A hiker is stuck in the mountains and is trying to get down (i.e., trying to find the minima). But, there is heavy fog so that visibility is extremely low. He can only see a small range around him. At this time, what this hiker can do is:

  1. Starting from his position on the mountain;
  2. Turn around and find the direction with the steepest descent;
  3. Proceeding a small step in that direction;

By repeating above 3 steps, he would eventually find his way down the mountain. 

Note: this is just an analogy, please don't really use this method when you get lost in the mountains. The correct way is to call the police for help. The trend of the mountains is normally non-convex, after all.

For ease of understanding, so far our explanation and analogy of gradient descent are all based on the scenario of one training data. However, in actual neural network training, we use tens of thousands of data, so how are they used in gradient descent algorithm? There are two commonly used methods.

The most basic method is the standard gradient descent, that is, the gradient of each iteration is the average of the gradient of all data points:

where n is the total number of the training data points.

The advantage of this method is that the gradient is accurate and the function converges fast. But when the training dataset is enormous, the evaluation of the gradient from all data points becomes expensive and the training time can be very long.

Another method is called stochastic gradient descent, which samples (with replacement) a subset (one or more) of training data to calculate the gradient. It's a bit like the bootstrapping algorithm I introduced earlier. Based on research, the prerequisite that this method can (almost) surely converge to the global minima is that the function must be a convex function or a pseudoconvex function, otherwise it almost surely converges to a local minima, not a global minima. The benefit of this method is obvious, it drastically reduces the computational cost at every iteration. Hence, it is very effective in the case of large-scale machine learning problems. As payback, the convergence of the function becomes slower.

in which, the Q_i represents evaluating the gradient from one randomly selected data point.

Or:

in which, let m is far less than the total number of data points n.


Backpropagation

The modern neural networks are typically composed of multiple layers, each layer contains multiple neurons and one bias (Note: For ease of reading, the term "bias" in this article sometimes refers to the constant nodes in a neural network, such as here. Sometimes, it refers to the weight connecting a constant node and a neuron), and they are connected by the weights. The data flows forward from the input layer to the output layer. And then, (in the case of supervised machine learning) we compare the predicted results with the real results and calculate the total error by the loss function. In fact, the process of minimizing the total error is the process of finding the minima of the loss function. Finding the minima? It sounds like we can just use the gradient descent algorithm we discussed in previous sections.

As we know, the loss function is a function of weights and biases. So, its gradient can be calculated by taking its derivative with respect to the weights and the biases, so that we know how much each variable contributes to the total error. However, since the relationship between the weights and the biases in the different layers is sort of iterated and accumulated, it is not an easy task to calculate the gradients with respect to them. And this is where backpropagation comes to the rescue!

If I was asked to describe backpropagation algorithm in one sentence, it would be: propagating the total error backward through the connections in the network layer by layer, calculate the contribution (gradient) of each weight and bias to the total error in every layer, then use gradient descent algorithm to optimize the weights and biases, and eventually minimize the total error of the neural network.

Before explaining backpropagation algorithm in detail, let’s do some preparation first. First of all, we need to build a simple multi-layer neural network as an example:

This network contains 5 layers. 1 input layer, 1 output layer and 3 hidden layers.

Also, we have the following relationship in above diagram:

in which, w_l_kj (w with superscript l and subscript kj, et cetera) represents the weight connecting the neuron j in the layer (l-1) and the neuron k in the layer l. The reason why we need 2 subscripts for weight w is because we not only need to mark which neuron this weight connects to, but also need to mark which neuron this weight starts from in the previous layer;

a_l-1_j represents the output of the neuron j in the layer (l-1);

b_l_k represents the weight connecting the neuron k and the constant node in the layer l, namely bias;

z_l_k represents the weighted input sum of all inputs of the neuron k in the layer l, which is just the input of the activation function of this neuron;

And the weighted input sum and the output of a neuron can be denoted by the following function:

in which, sigma represents the activation function.

Similarly, we can use the same functions to denote the relationships between the equivalent elements in other layers.

In addition, we also need to leverage the “Chain Rule”:

Let z=f(y), y=g(x), then the derivative of z with respect to x can be written as:


WARNING: massive math calculation is ahead!

Error Signal

Firstly, let’s make a definition “error signal”. Let:

It can measure how much the total error changes when the weighted input sum of the neuron is changed.

We can expand above expression by chain rule:

We can conclude two points from above expression:

  1. In order to obtain the error signal of a neuron in layer l, the error signals of all neurons in the layer (l+1) have to be calculated first. In other words, the error signal is calculated recursively layer by layer, namely backpropagation.
  2. The error signal of a neuron is composed of two components:
  • The weighted sum of the error signals of all neurons in the next layer which this neuron connects with;
  • The derivative of this neuron’s activation function.

Since the whole process starts from the output layer, the key point is to calculate the error signal of the neurons in the output layer. The following derivation illustrates how to do it:

Is that all? It still doesn’t seem we can calculate the result directly, does it? Actually we can. I just don’t want to specify any concrete loss function and activation function in this derivation. If we assume that the loss function is the square error function and the activation function is the Sigmoid function, the formula will be more straightforward:

in which, y_n is the output of the neuron n in the output layer, a known number;

t_n is the expected result of the neuron n, part of the training data, a known number;

sigma represents the activation function of the output layer, known;

z_L_n is the weighted input sum of the neuron n, a known number;

Hence, the error signal of output layer is calculable!

Calculate the Gradients of the Weights

Now, we can finally derive the gradient formula of an arbitrary weight in a neural network, that is, the derivative of the loss function with respect to that weight. Based on chain rule and the definition of the error signal, we have the following transformation:

Expand the 2nd half of above formula:

Combine above two steps:

which means:

(the gradient of a weight) = (the error signal of the neuron that this weight points to) x (the output of the neuron that this weight starts from)

Wait a moment, you may wonder what happened with the biases, they are also “weights” and the error function should be derived with respect to them as well. Here you go:

The gradient of the cost function with respect to the bias for each neuron is simply the neuron’s error signal!

Update Weights

Given the gradient, according to gradient descent algorithm, we can get the formula to update the weights:

in which, eta is the learning rate;

delta_l_k is the error signal of the neuron that the weight points to;

a_l-1_j is the output of the neuron that the weight starts from.

It is important to note that the weights should be updated only when the error signal of each neuron is calculated. All weight updates are carried out in one shoot AFTER one iteration of backpropagation.

I recommend you have a close look at the following diagram, which will give you better understanding about backpropagation algorithm:


At last, let’s summarize the training process of using (stochastic) gradient descent algorithm:

  1. (randomly take several samples from the training dataset), send the data points into the neural network;
  2. Forward propagate the data points through the network individually and get the outputs, respectively;
  3. Use loss function to calculate the total error for each data point;
  4. (For each data point), use backpropagation algorithm to calculate the gradient of the loss function with respect to each weight and bias, (and then take the average of gradients);
  5. Update the weights and biases using gradient descent algorithm;
  6. Repeat above steps to minimize the total error.

要查看或添加评论,请登录

Ken Chen的更多文章

  • Unlocking New Dimensions in ChatGPT's Performance: Innovative Practices & Insights

    Unlocking New Dimensions in ChatGPT's Performance: Innovative Practices & Insights

    Recently, while scrolling through Weibo, I stumbled upon an intriguing article (https://weibo.com/1812166904/NormlrC1M)…

  • My Simple Implementation of AlphaGo Zero on Connect4

    My Simple Implementation of AlphaGo Zero on Connect4

    I recently implemented and trained a simple version of AlphaGo Zero on Connect4 game by using Python+Keras(Tensorflow…

  • Notes about Monte Carlo Tree Search

    Notes about Monte Carlo Tree Search

    1. Monte Carlo Tree is a tree-like data structure in which a node can have multiple child nodes.

  • K-means Algorithm

    K-means Algorithm

    Since the auto encoder algorithm I introduced earlier, let us talk about another unsupervised machine learning…

  • Autoencoder

    Autoencoder

    The machine learning methods discussed in my previous articles are all supervised machine learning methods. The…

    1 条评论
  • Activation Functions and Loss Functions

    Activation Functions and Loss Functions

    The medical definition of neurons is that they are the fundamental units of the brain and nervous system, the cells…

  • Boosting

    Boosting

    In the last article we introduced an ensemble algorithm based on Bootstrap technology. In this article, we will…

  • Aggregating Bootstrap (Bagging)

    Aggregating Bootstrap (Bagging)

    We discussed the Bootstrapping in my previous article. Based on that, we can take a look at another popular machine…

  • Some Notes about Bootstrapping

    Some Notes about Bootstrapping

    I wrote an article about Decision Tree several days ago. On this basis, we could try to expend our knowledge by…

  • Some of my notes about Convolution

    Some of my notes about Convolution

    This article only contains some of my study notes about convolution and convolutional neural network (CNN). If you…

社区洞察

其他会员也浏览了