The Backpropagation Algorithm in Neural Nets is Just Linear?Algebra
Collins Patrick O.
Research Analyst @ Dole | Data Science & AI in Agriculture | Precision Farming & Sustainability
Without the backpropagation algorithm, there will be no learning in Machine Learning, pun intended. It is the backbone of modern neural networks, enabling them to learn from data and improve their performance over time. In this post, we’ll attempt to break down the inner workings of this algorithm, exploring its foundational mathematical concepts and practical applications.
Having trained and deployed Machine Learning and Neural Network models for over 2 years, I finally decided to look behind the curtain and all those abstractions provided by a simple import statement from Tensorflow, Keras, and Pytorch. And what did I find? Well, linear algebra. Yes, linear regression. Don’t get me wrong, it is also littered with matrix calculus, but the driving force behind all this hype is ultimately the best fit for millions of parameters.?
In his thought-provoking article discussing the slow progress of innovation in machine learning, Mark Saroufim highlighted a key point. When his students asked him about the essential math required for deep learning, he simplified it to “Matrix Multiplication and Derivatives of square functions.”
It’s fairly common to say that mathematics is unreasonable effective in describing the natural world. Similarly, within mathematics, linear algebra is unreasonable effective in studying the rest of mathematics. ~Michael Penn
Before we dive into what backpropagation does, let’s first explore how neural networks make predictions through an algorithm called forward propagation.
Forward Propagation:?
?Let’s say we have the Linear Regression Equation: [ y = mx + b ]
To determine the exact equation, you’ll need the specific values of (m) and (b). This can be obtained from the best-fit line in the graph above, such that for any value (x), we can substitute it to calculate (y). It seems pretty straightforward. Next, imagine we have a dataset of images, such as the famous MNIST handwritten digits. Each image is represented as a matrix of 24x24 pixel values, and our goal is to have a machine classify these images into their respective digits. The 784 (24x24) pixel values would be the (X) values, and the gradient and y-intercept would be the parameters (called ‘weights’ and ‘biases’ in ML folklore) that allow the machine to classify these images. These 784-pixel values would result in 784 linear equations, with varying weights (wi) and biases (bi) for every given pixel input (xi). Zooming into the first-pixel value, this equation would look like this: [ z1= w1 ? x1 + b1 ]. Thus, the machine computes the weighted sum of these values to determine the output.
However, simply summing up the inputs with their respective weights doesn’t allow the neural network to handle complex tasks effectively. So, it’s better to transform this sum into a nonlinear space. One common way to do this is by using the sigmoid function. It works like this: if the weighted sum is very positive, the neuron output is close to 1, and if it’s very negative, the output is close to 0. These transformations, like sigmoid, are called activation functions. They’re super important because they decide if a neuron should “fire up” or not?—?meaning, whether the information it’s getting is relevant. In other words, without an activation function, a neural network would be a simple linear regression model.
From the diagram above, let’s say we have an input (x1) value of 0.1 and want to predict the output (a2). If we have a neural network with optimized weight (w1 = 0.15) and bias (b1 = 0.4), we start by calculating z1, the result of a dot multiplying the input by its weight, and adding the bias. In this case, z1 comes out to be 0.415. Then, we pass this value through the sigmoid function, which gives us an output (a1) of 0.6023.
Furthermore, if our network has two neurons, the first neuron's output will become the second one's input. The process repeats: we compute the dot product of the input (a1) and its weight (w2), add the bias (b2), and apply the sigmoid function. This time, the network produces an output (a2) of 0.7153.
This is how a neural network predicts the output for any given input. Thus, the equations for the forward propagation algorithm are as follows:
Gradient Descent:
In our forward propagation algorithm, we assumed that the neural network already had optimized weights and biases. But how exactly do neural networks learn and optimize these parameters for a given problem and dataset?
Training occurs in a supervised learning setup, where each data point comes with a corresponding label or ground truth. The training process kicks off when the predicted values by the neural network don’t align with the ground truth for a given input. Initially, the error (E) between the predicted values and the ground truth labels is calculated?—?this is often referred to as the cost or loss function. For our simple network with only two neurons, taking just one input, we compute the squared error between the ground truth T and the predicted value (a2) as follows.
Next, we propagate this error back into the network and utilize it to perform gradient descent on the various weights and biases within the network. But what exactly is gradient descent? Gradient descent is an iterative optimization algorithm employed to find the minimum of a function?—?in this case, the cost function (squared error)?—?while adhering to a predefined pace known as the learning rate.
Essentially, we multiply the gradient (which indicates the derivative of the error with respect to the weights or biases) by the learning rate to determine the direction and magnitude of the update necessary to minimize the error. This is a core principle in gradient descent optimization: the objective is to adjust the weights and biases and minimize the error as much as possible. The gradient offers the direction for this adjustment, while the learning rate regulates the size of each step. The weights or biases for our simple network with only two neurons are updated as follows (where alpha is the learning rate):
Backpropagation:
Since the error is not directly a function of the weights and biases, we can’t directly calculate the derivative of the error with respect to them. Instead, we’ll use the chain rule to establish this derivative.
领英推荐
Updating weights and biases for the second?layer:
To update w2, we first recognize that E is a function of a2, a2 is a function of z2, and z2 is a function of w2. Therefore, we can take the derivative of E with respect to a2, the derivative of a2 with respect to z2, and the derivative of z2 with respect to w2. Then, the derivative of the error (E) with respect to w2 is simply the product of these individual derivatives.?
Updating b2 follows the same process as updating w2, except that the derivative of z2 with respect to b2 is 1 instead of the input a1.
Updating weights and biases for the first?layer:
Again, we use a similar expression to update w1, however, given that the error isn’t directly related to w1, we employ the chain rule to establish the derivative of the error with respect to w1.
Starting from E as a function of a2, we know that a2 is a function of z2, z2 is a function of a1, a1 is a function of z1, and z1 is a function of w1. Therefore, we can take the derivative of E with respect to a2, the derivative of a2 with respect to z2, the derivative of z2 with respect to a1, the derivative of a1 with respect to z1, and finally the derivative of z1 with respect to w1.
Then, the derivative of the error (E) with respect to w1 is simply the product of these individual derivatives.
Updating b1 follows the same process as updating w1, except that the derivative of z1 with respect to b1 is 1 instead of the input
Applying Backpropagation:
Let’s apply backpropagation to our example from forward propagation. Assuming the ground truth (T) is 0.25, we first compute the error between the ground truth and the predicted value.
Next, the weights and biases will be updated for either a predefined number of iterations or epochs?—?say 1000, or until the error is within a predefined threshold, like 0.001.
Using a learning rate of 0.4, let’s observe how the weights and biases change after the first iteration.
This marks the completion of the first iteration or epoch of the training process. With the updated weights and biases, we proceed with another round of forward propagation to calculate the predicted value and compare it to the ground truth. Subsequently, we calculate the error and continue with another round of backpropagation.
In summary, the training algorithm unfolds as follows: 1. Initialize the weights and biases to random values.
2. Iteratively repeat the following steps: a. Calculate the network output using forward propagation. b. Calculate the error between the ground truth and the predicted output of the network. c. Update the weights and biases through backpropagation.
3. Repeat the above three steps, for the specified number of iterations or until the error between the ground truth and the predicted output falls below a predefined threshold.
?? Neural Networks: From Mathematics to?Code
If you’re curious about how to translate these mathematical concepts into real code, then this is exactly what you’re looking for. I’ve put together a Notebook where you can interact with these ideas and construct your neural network from “scratch”. There is no need for fancy libraries like TensorFlow, Keras, or PyTorch, and forget about pre-trained models?—?this is all about getting your hands dirty with raw code. Every mathematical computation is done using the NumPy library. I hope you give it a try and find it helpful. And if you’re feeling generous, don’t forget to leave an upvote. Happy coding!