Backpropagation algorithm?-?A fundamental building block in a neural?network.

Backpropagation algorithm?-?A fundamental building block in a neural?network.

In his paper “Gradient Theory of Optimal Flight Paths” (1960), Henry J. Kelley presents the first version of the continuous backpropagation model. His model is based on Control Theory, but it sets the groundwork for further improvement and was employed in ANN in the future.

Almost 30 years later in 1989, it was popularized by Rumelhart, Hinton, and Williams in a paper called Learning representations by back-propagating errors.

Through chain rule, the algorithm is utilized to successfully train a neural network. Backpropagation conducts a backward pass across a network after each forward pass while modifying the model’s parameters (weights and biases).

In this article you can read about?:

  • Activation function and its importance.
  • Neural network model. (Input layer, Hidden layer, Output layer)
  • Forward propagation
  • Cost function
  • Moving backward through the network.
  • How does it work in ML problems?

Activation function and its importance.

In an artificial neural network, the activation function of a neuron defines the output of that neuron given a set of inputs. Without it, the multilayer perception is a composition of successive linear functions. In that case, we need an activation function (Non-linear function) that introduces non-linearity into the model.

“If the activation functions of all the hidden units in a network are taken to be linear, then for any such network we can always find an equivalent network without hidden units.” [9] (Bishop 2006)

let's consider an activation function called Sigmoid. The sigmoid takes in an input and if it's a very negative number then the sigmoid transform it into a number very close to 0. likewise, if the input is a very positive number then the sigmoid will change the input to a number very close to 1, and if the input is close to 0 then it will transform the input into some number between 0 and 1.


Node output = activation(weighted sum of inputs)

So for sigmoid 0 is the lower limit and 1 is the upper limit. With the Sigmoid activation function in an artificial neural network, we can see that the neuron can be between 0 and 1, and the closer to 1, the more activated that neuron is while the closer to 0 the less activated that neuron is.

Neural network model. (Input layer, Hidden layer, Output layer)


Now we forward-propagate

The hidden node 0 is at the top of the diagram.

The first step is to sum each input times each input’s associated weight: (2.0)(0.1) + (3.0)(0.5) + (4.0)(0.9) = 5.3. Next the bias value is added: 5.3 + (-2.0) = 3.3. The third step is to feed the result of step two to an activation function: 1.0 / (1.0 + Exp(-3.3)) = 0.96.

Similarly, y1 = 0.62. Here y0 is closer to the expected output and y1 isn't. The idea is to decrease the activation for y1 in order to make the network efficient in the next iteration.

Cost function

For the cost of a single training example, We need to take the output that the network gives (y0, y1) along with the expected output and add up the squares of the difference between each component. Let's consider the expected output to be (1,0). Then

Doing the same for all the training examples, and averaging the result gives you the total cost of the network. Now what we are looking for here is the negative gradient of this cost function which tells us how we need to change all of the weights and biases, all of the connections, in order to efficiently decrease the cost. The backpropagation algorithm is for computing this highly complicated gradient.

So if we wish to decrease/increase the activation of a neuron we can?:

  • decrease/increase the bias
  • decrease/increase the weight
  • change activation from the previous layer

With gradient descent, we need not care about whether each component should get nudged up or down, we just need to know which ones make it most efficient.

“Neurons that fire together wire together”?—?Donald Hebb

Moving backwards through the?network.

So to change activations in the previous layer, we need to increase everything connected to the positive weights and decrease the ones which are negative then the neuron will be technically more active. Also, the changes should be proportional to the size of corresponding weights. Here we cannot directly make changes in those activations, we can only control the weight and biases. But just as with the last layer, it’s helpful to just keep a note of what those desired changes are.

Remember, we also want all of the other neurons which are not closer to the expected output in the last layer to become less active, and each of those other output neurons has its own thoughts about what should happen to that second-to-last layer. Those changes need to be in proportion to the corresponding weights, and in proportion to how much each of those neurons needs to change.

This is where the idea of propagating backward comes in. The backpropagation aims at minimizing the cost function by adjusting the network’s weight and biases. The core of neural net training is back-propagation. It is the process of fine-tuning the weights of a neural network depending on the preceding epoch’s error rate (i.e. loss). Weight tweaking ensures reduced error rates, boosting the model’s generalization and making it more dependable.

So based on the paper Learning representations by back-propagating errors.

“repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector.”

How does it work in ML problems? “stochastic gradient descent”.

You randomly shuffle your training data, and then divide it into a whole bunch of mini-batches, each one having 100 training examples.

Then compute a step according to the mini-batch. It’s not going to be the actual gradient of the cost function, which depends on all of the training data, not this tiny subset.

So it’s not the most efficient step downhill. But each mini-batch does give you a pretty good approximation, and more importantly, it gives you a significant computational speed up.

This technique is referred to as “stochastic gradient descent”. Repeatedly doing this, you will converge towards a local minimum of the cost function, meaning your network is going to end up doing a really good job on the training examples.

Remarks: In this article, I mentioned how backpropagation works under the hood by using the cost function, gradients. Taking an example of a simple neural network and trying to compute the same will help you get a better understanding of how the math behind neural networks work.

You can also follow me on find me on linkedin, email me directly and I’d love to hear from you.


要查看或添加评论,请登录

社区洞察

其他会员也浏览了