登录查看更多内容

The Backpropagation Algorithm in Neural Nets is Just Linear?Algebra

Collins Patrick O.

Research Analyst @ Dole | Data Science & AI in Agriculture | Precision Farming & Sustainability

发布日期: 2024年4月16日

Without the backpropagation algorithm, there will be no learning in Machine Learning, pun intended. It is the backbone of modern neural networks, enabling them to learn from data and improve their performance over time. In this post, we’ll attempt to break down the inner workings of this algorithm, exploring its foundational mathematical concepts and practical applications.

Having trained and deployed Machine Learning and Neural Network models for over 2 years, I finally decided to look behind the curtain and all those abstractions provided by a simple import statement from Tensorflow, Keras, and Pytorch. And what did I find? Well, linear algebra. Yes, linear regression. Don’t get me wrong, it is also littered with matrix calculus, but the driving force behind all this hype is ultimately the best fit for millions of parameters.?

In his thought-provoking article discussing the slow progress of innovation in machine learning, Mark Saroufim highlighted a key point. When his students asked him about the essential math required for deep learning, he simplified it to “Matrix Multiplication and Derivatives of square functions.”

It’s fairly common to say that mathematics is unreasonable effective in describing the natural world. Similarly, within mathematics, linear algebra is unreasonable effective in studying the rest of mathematics. ~Michael Penn

Before we dive into what backpropagation does, let’s first explore how neural networks make predictions through an algorithm called forward propagation.

Forward Propagation:?

?Let’s say we have the Linear Regression Equation: [ y = mx + b ]

(y) represents the dependent variable (output).
(x) represents the independent variable (input).
(m) is the slope/gradient (rate of change) of the line.
(b) is the y-intercept (where the line crosses the y-axis).

To determine the exact equation, you’ll need the specific values of (m) and (b). This can be obtained from the best-fit line in the graph above, such that for any value (x), we can substitute it to calculate (y). It seems pretty straightforward. Next, imagine we have a dataset of images, such as the famous MNIST handwritten digits. Each image is represented as a matrix of 24x24 pixel values, and our goal is to have a machine classify these images into their respective digits. The 784 (24x24) pixel values would be the (X) values, and the gradient and y-intercept would be the parameters (called ‘weights’ and ‘biases’ in ML folklore) that allow the machine to classify these images. These 784-pixel values would result in 784 linear equations, with varying weights (wi) and biases (bi) for every given pixel input (xi). Zooming into the first-pixel value, this equation would look like this: [ z1= w1 ? x1 + b1 ]. Thus, the machine computes the weighted sum of these values to determine the output.

However, simply summing up the inputs with their respective weights doesn’t allow the neural network to handle complex tasks effectively. So, it’s better to transform this sum into a nonlinear space. One common way to do this is by using the sigmoid function. It works like this: if the weighted sum is very positive, the neuron output is close to 1, and if it’s very negative, the output is close to 0. These transformations, like sigmoid, are called activation functions. They’re super important because they decide if a neuron should “fire up” or not?—?meaning, whether the information it’s getting is relevant. In other words, without an activation function, a neural network would be a simple linear regression model.

From the diagram above, let’s say we have an input (x1) value of 0.1 and want to predict the output (a2). If we have a neural network with optimized weight (w1 = 0.15) and bias (b1 = 0.4), we start by calculating z1, the result of a dot multiplying the input by its weight, and adding the bias. In this case, z1 comes out to be 0.415. Then, we pass this value through the sigmoid function, which gives us an output (a1) of 0.6023.

Furthermore, if our network has two neurons, the first neuron's output will become the second one's input. The process repeats: we compute the dot product of the input (a1) and its weight (w2), add the bias (b2), and apply the sigmoid function. This time, the network produces an output (a2) of 0.7153.

This is how a neural network predicts the output for any given input. Thus, the equations for the forward propagation algorithm are as follows:

Gradient Descent:

In our forward propagation algorithm, we assumed that the neural network already had optimized weights and biases. But how exactly do neural networks learn and optimize these parameters for a given problem and dataset?

Training occurs in a supervised learning setup, where each data point comes with a corresponding label or ground truth. The training process kicks off when the predicted values by the neural network don’t align with the ground truth for a given input. Initially, the error (E) between the predicted values and the ground truth labels is calculated?—?this is often referred to as the cost or loss function. For our simple network with only two neurons, taking just one input, we compute the squared error between the ground truth T and the predicted value (a2) as follows.

Next, we propagate this error back into the network and utilize it to perform gradient descent on the various weights and biases within the network. But what exactly is gradient descent? Gradient descent is an iterative optimization algorithm employed to find the minimum of a function?—?in this case, the cost function (squared error)?—?while adhering to a predefined pace known as the learning rate.

Essentially, we multiply the gradient (which indicates the derivative of the error with respect to the weights or biases) by the learning rate to determine the direction and magnitude of the update necessary to minimize the error. This is a core principle in gradient descent optimization: the objective is to adjust the weights and biases and minimize the error as much as possible. The gradient offers the direction for this adjustment, while the learning rate regulates the size of each step. The weights or biases for our simple network with only two neurons are updated as follows (where alpha is the learning rate):

Backpropagation:

Since the error is not directly a function of the weights and biases, we can’t directly calculate the derivative of the error with respect to them. Instead, we’ll use the chain rule to establish this derivative.

领英推荐

A Comprehensive Overview of Classification Methods

Utpal Dutta 8 个月前

Intuition and Mathematics behind Gradient Descent…

Gaurav Singh 3 个月前

Neural Network Learning Explained with Examples | Part…

Vineet Markan 2 个月前

Updating weights and biases for the second?layer:

To update w2, we first recognize that E is a function of a2, a2 is a function of z2, and z2 is a function of w2. Therefore, we can take the derivative of E with respect to a2, the derivative of a2 with respect to z2, and the derivative of z2 with respect to w2. Then, the derivative of the error (E) with respect to w2 is simply the product of these individual derivatives.?

Updating b2 follows the same process as updating w2, except that the derivative of z2 with respect to b2 is 1 instead of the input a1.

Updating weights and biases for the first?layer:

Again, we use a similar expression to update w1, however, given that the error isn’t directly related to w1, we employ the chain rule to establish the derivative of the error with respect to w1.

Starting from E as a function of a2, we know that a2 is a function of z2, z2 is a function of a1, a1 is a function of z1, and z1 is a function of w1. Therefore, we can take the derivative of E with respect to a2, the derivative of a2 with respect to z2, the derivative of z2 with respect to a1, the derivative of a1 with respect to z1, and finally the derivative of z1 with respect to w1.

Then, the derivative of the error (E) with respect to w1 is simply the product of these individual derivatives.

Updating b1 follows the same process as updating w1, except that the derivative of z1 with respect to b1 is 1 instead of the input

Applying Backpropagation:

Let’s apply backpropagation to our example from forward propagation. Assuming the ground truth (T) is 0.25, we first compute the error between the ground truth and the predicted value.

Next, the weights and biases will be updated for either a predefined number of iterations or epochs?—?say 1000, or until the error is within a predefined threshold, like 0.001.

Using a learning rate of 0.4, let’s observe how the weights and biases change after the first iteration.

This marks the completion of the first iteration or epoch of the training process. With the updated weights and biases, we proceed with another round of forward propagation to calculate the predicted value and compare it to the ground truth. Subsequently, we calculate the error and continue with another round of backpropagation.

In summary, the training algorithm unfolds as follows: 1. Initialize the weights and biases to random values.

2. Iteratively repeat the following steps: a. Calculate the network output using forward propagation. b. Calculate the error between the ground truth and the predicted output of the network. c. Update the weights and biases through backpropagation.

3. Repeat the above three steps, for the specified number of iterations or until the error between the ground truth and the predicted output falls below a predefined threshold.

?? Neural Networks: From Mathematics to?Code

If you’re curious about how to translate these mathematical concepts into real code, then this is exactly what you’re looking for. I’ve put together a Notebook where you can interact with these ideas and construct your neural network from “scratch”. There is no need for fancy libraries like TensorFlow, Keras, or PyTorch, and forget about pre-trained models?—?this is all about getting your hands dirty with raw code. Every mathematical computation is done using the NumPy library. I hope you give it a try and find it helpful. And if you’re feeling generous, don’t forget to leave an upvote. Happy coding!

Data-Driven Insights

897 位关注者

要查看或添加评论，请登录

Collins Patrick O.的更多文章

AlphaFold 3 is Beyond Protein?Folding

2024年5月15日

AlphaFold 3 is Beyond Protein?Folding

In the early 1960s, Christian Anfinsen demonstrated that protein could spontaneously (with nothing extra) refold in a…
Gemma – What Is It and How Can I Use It?

2024年2月26日

Gemma – What Is It and How Can I Use It?

Gemma falls in the category of foundation models such as text-to-image, text-to-code and speech-to-text. Foundation…

1 条评论
Pythonizing Business Efficiency (Part4): Task III & IV — Supply Chain Optimization with the PuLP Library

2024年2月23日

Pythonizing Business Efficiency (Part4): Task III & IV — Supply Chain Optimization with the PuLP Library

So far, we have been able to predict the biomass production for 2018 and 2019 and programmatically determine the…

2 条评论
Pythonizing Business Efficiency (Part3): Task II — Linear Programming Powered by Google’s OR-Tools

2024年2月18日

Pythonizing Business Efficiency (Part3): Task II — Linear Programming Powered by Google’s OR-Tools

If you were following my story, you already know the next in line. So far we have been able to predict the biomass…

2 条评论
Pythonizing Business Efficiency (Part2): Task I — Yield Forecasting Using LSTM

2023年11月21日

Pythonizing Business Efficiency (Part2): Task I — Yield Forecasting Using LSTM

See the introduction if you haven’t already?—?it’s the prequel to the supply chain problem we’re untangling. The…

1 条评论
Pythonizing Business Efficiency: Bioenergy Supply Chain Optimization

2023年9月11日

Pythonizing Business Efficiency: Bioenergy Supply Chain Optimization

Quest for Sustainable Energy So, I recently participated in my very first Shell.ai Hackathon (Waste to Energy?—?Shell.
Predicting Wild Blueberry Yield: A Comparative Analysis of RF, GB, and XGB Models

2023年7月10日

Predicting Wild Blueberry Yield: A Comparative Analysis of RF, GB, and XGB Models

In machine learning, ensemble models almost always outperform individual models. Here, numerous machine learning models…

1 条评论
Machine Learning Unveils House Price Predictions!

2023年6月30日

Machine Learning Unveils House Price Predictions!

48 hours ago, I came across an interesting Housing Price Dataset on Kaggle that contains 79 different factors…

1 条评论

See all articles

The Backpropagation Algorithm in Neural Nets is Just Linear?Algebra

Collins Patrick O.

Research Analyst @ Dole | Data Science & AI in Agriculture | Precision Farming & Sustainability

Forward Propagation:?

Gradient Descent:

Backpropagation:

领英推荐

Updating weights and biases for the second?layer:

Updating weights and biases for the first?layer:

Applying Backpropagation:

?? Neural Networks: From Mathematics to?Code

Data-Driven Insights

897 位关注者

Collins Patrick O.的更多文章

社区洞察

其他会员也浏览了

TensorFlow Introduces TensorFlow Graph Neural Networks (TF-GNNs)

Introduction to Neural Networks, from scratch for practical learning (Part 2)

Model Sub-Classing and Custom Training Loop from Scratch in TensorFlow 2

Introduction to Neural Networks, from scratch for practical learning (Part 1)

Adversaries Make the Best Teachers: Inside the GAN Revolution

Deep learning for image recognition: Convolutional Neural Network with Tensorflow

Getting started with Computer Vision to Develop Enterprise Applications

How Neural Network Works

Forward Propagation:?

Gradient Descent:

Backpropagation:

领英推荐

Updating weights and biases for the second?layer:

Updating weights and biases for the first?layer:

Applying Backpropagation:

?? Neural Networks: From Mathematics to?Code

Data-Driven Insights

897 位关注者

Collins Patrick O.的更多文章

AlphaFold 3 is Beyond Protein?Folding

Gemma – What Is It and How Can I Use It?

Pythonizing Business Efficiency (Part4): Task III & IV — Supply Chain Optimization with the PuLP Library

Pythonizing Business Efficiency (Part3): Task II — Linear Programming Powered by Google’s OR-Tools

Pythonizing Business Efficiency (Part2): Task I — Yield Forecasting Using LSTM

Pythonizing Business Efficiency: Bioenergy Supply Chain Optimization

Predicting Wild Blueberry Yield: A Comparative Analysis of RF, GB, and XGB Models

Machine Learning Unveils House Price Predictions!

社区洞察

其他会员也浏览了

TensorFlow Introduces TensorFlow Graph Neural Networks (TF-GNNs)

Introduction to Neural Networks, from scratch for practical learning (Part 2)

Model Sub-Classing and Custom Training Loop from Scratch in TensorFlow 2

Introduction to Neural Networks, from scratch for practical learning (Part 1)

Adversaries Make the Best Teachers: Inside the GAN Revolution

Deep learning for image recognition: Convolutional Neural Network with Tensorflow

Getting started with Computer Vision to Develop Enterprise Applications

How Neural Network Works