Activation functions and its types:
Abhishek Zirange
Consultant | Product Engineer | Freelance Python Developer , Django, NLP, Machine learning | Redis | Celery | Web Scraping Specialist
NOTE: I would recommend reading up on the basics of Artificial Neural Network before reading this article for better understanding
Activation functions are really important for an Artificial neural network to learn and make sense of something complicated and Non-linear complex functional mappings between the inputs and response variable. They introduce non-linear properties to our Network. Their main purpose is to convert an input signal of a node in an A-NN to an output signal. That output signal now is used as an input in the next layer in the stack.
Specifically in A-NN we do the sum of products of inputs(X) and their corresponding Weights(W) and apply an Activation function f(x) to it to get the output of that layer and feed it as an input to the next layer.
So what does an artificial neuron do? Simply put, it calculates a “weighted sum” of its input, adds a bias, and then decides whether it should be “fired” or not.
1. Sigmoid
Sigmoid takes a real value as input and outputs another value between 0 and 1. The main reason why we use the sigmoid function is that it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output. Since the probability of anything exists only between the range of 0 and 1, sigmoid is the right choice. It is an S-shaped curve.
With the help of the Sigmoid activation function, we can reduce the loss during the time of training because it eliminates the gradient problem in the machine learning model while training.
It is an activation function of form f(x) = 1 / 1 + exp(-x) .
def sigmoid(x):
return 1/(1+ np.exp(-x))
Pros:
- It is nonlinear in nature. Combinations of this function are also nonlinear!
- It will give an analog activation, unlike step function.
- It has a smooth gradient too.
- It’s good for a classifier.
- The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of a linear function. So we have our activations bound in a range. Nice, it won’t blow up the activations then.
Cons :
- It gives rise to the problem of “vanishing gradients”.
- Its output isn’t zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
- Sigmoids saturate and kill gradients.
- Sigmoids have slow convergence.
- The network refuses to learn further or is drastically slow ( depending on the use case and until gradient /computation gets hit by floating-point value limits ).
2. Tanh Activation Function:
fig. Tanh function and derivative
Tanh is the modified version of the sigmoid function. Hence have similar properties of the sigmoid function.
fig. Sigmoidal representation of Tanh
- The function and it's derivative both are monotonic
- The output is zero “centric”
- Optimization is easier
- Derivative /Differential of the Tanh function (f’(x)) will lie between 0 and 1.
Cons:
- The derivative of Tanh function suffers “Vanishing gradient and Exploding gradient problem”.
- Slow convergence- as its computationally heavy. (Reason use of exponential math function )
“Tanh is preferred over the sigmoid function since it is zero centered and the gradients are not restricted to move in a certain direction”
3. ReLu Activation Function(ReLu — Rectified Linear Units):
ReLu function and Derivative of ReLu
ReLU is the non-linear activation function that has gained popularity in AI. ReLu function is also represented as f(x) = max(0,x).
- The function and it's derivative both are monotonic.
- The main advantage of using the ReLU function- It does not activate all the neurons at the same time.
- Computationally efficient
- Derivative /Differential of the Tanh function (f’(x)) will be 1 if f(x) > 0 else 0.
- Converge very fast
Cons:
- ReLu function is not “zero-centric”.This makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
- The dead neuron is the biggest problem. This is due to Non-differentiable at zero.
“Problem of Dying neuron/Dead neuron : As the ReLu derivative f’(x) is not 0 for the positive values of the neuron (f’(x)=1 for x ≥ 0), ReLu does not saturate (exploid) and no dead neurons (Vanishing neuron)are reported. Saturation and vanishing gradient only occur for negative values that, given to ReLu, are turned into 0- This is called the problem of dying neuron.”
4. leaky ReLu Activation Function:
Leaky ReLU function is nothing but an improved version of the ReLU function with the introduction of “constant slope”
fig. Leaky ReLu activation, Derivative
- Leaky ReLU is defined to address the problem of dying neuron/dead neurons.
- The problem of dying neuron/dead neuron is addressed by introducing a small slope having the negative values scaled by α enables their corresponding neurons to “stay alive”.
- The function and it's derivative both are monotonic
- It allows negative value during backpropagation
- It is efficient and easy for computation.
- Derivative of Leaky is 1 when f(x) > 0 and ranges between 0 and 1 when f(x) < 0.
Cons:
- Leaky ReLU does not provide consistent predictions for negative input values.
The question was which one is better to use?
Answer to this question is that nowadays we should use ReLu which should only be applied to the hidden layers. And if your model suffers from dead neurons during training we should use leaky ReLu or Maxout function.
It’s just that Sigmoid and Tanh should not be used nowadays due to the vanishing Gradient Problem which causes a lot of problems to train, degrades the accuracy and performance of a Deep Neural Network Model.
Building Supervity AI ???
4 年Good one Abhishek