The Importance of initializing the weights properly
We all know W.T*X+b. Multiplying the input with the transpose of the weights and adding a bias. We see this in the linear, logistic regression, and neural networks. Without W and b, we cannot learn. We will have no point to start our forward propagation resulting in no loss/cost function, no backpropagation, nothing. We have been learning that w and b are something we must initialize randomly, and we will optimize them in multiple iterations after calculating the cost function and going through backpropagation. (p.s bias is initialized with 0 by default. )
?So, the main question is how random that "randomly initializing" thing should be?
Things to consider while initializing the weights: It should not be zero or else input data will not contribute to getting output and, it should not be the same if not we will have a symmetry-breaking problem. ‘Nodes that are connected side-by-side in a hidden layer connected to the same node must have different weights for the learning algorithms to update the weights’,?Jason Brownlee?sir
Now, the only option is initializing with non-zero and distinct values that are normally distributed with mean 0 and standard deviation 1. What if, those randomly initialized weights are gigantic or tiny? Randomly initializing the weights without considering anything also has two possible issues. 1: Vanishing gradient descent and 2: Exploding gradient descent. In order to tackle these issues, researchers have come up with multiple approaches among them 2 most popular are Xavier/Glorot initialization and He initialization. summarizing all the works of these two techniques in simple words: This initialization makes the variance of the weights lesser than 1 i.e., 1/n, where n stands for the number of input weights and keeps the weights around 1 which will help to minimize the above-mentioned issues.
When to use which? Answer: If we are using the Relu activation function in our hidden layers, it is preferred to use He initialization developed by?Kaiming He?sir and if the activation function is sigmoid/tanh then we will get the best result from Xavier/glorot initialization developed by?Xavier Glorot?sir. We can set these in our Keras sequential model’s dense layers adding the parameters kernel_initializer.
领英推荐
And last, we have two types of each weight initialization technique implemented following the normal distribution and the uniform distribution.
To implement xavier/glorot’s weight initialization that follows normal distribution use kernel_initializer = ‘glorot_normal’ and for uniform distribution, use kernel_initializer = ‘glorot_uniform’
To implement He weight initialization that follows normal distribution use kernel_initializer = ‘he_normal’ and for uniform distribution, use kernel_initializer = ‘he_uniform’
?By default, the weight initializer is glorot_uniform for the neural network. In the image, I have shown what changes both initialization techniques do in standard deviation to initialize the weights. In the denominator, I have used the notation fan_in and fan_out. Fan_in means the total number of nodes of the previous layer and fan_out means the number of nodes in a current layer.