Machine Learning & Activation Function
Machine learning uses Neural Networks for its structural framework to build the engine of learning and prediction. Each Neuron in the neural network has weights, biases and activation functions as its core skeleton to prepare the learning model.
If you see the discipline of ML and Neural networks - is to mimic the learning process of human brain so that machines can follow a similar approach to learn from test data like brain learns from experience and then predict results from new input data.
Weights and Biases are linear techniques to map the neuron parameters so that the learning is appropriate. But the natural environment where we will use the ML programs for image classification or NLP (Natural Language Processing) and other real life problems is more non-linear than linear. Like human brain filters out or segregates or discriminates more useful information than not needed or less beneficial information to take the next steps or arrive at a better decision during a problem scenario; neurons use the activation function to decide whether the inputs have to be taken to the next step for processing an output or discard it at the first step.
There are many Activation functions which are used currently by programmers to achieve the above need while training the network. Few examples are as below –
· Sigmoid
It is an activation function of form f(x) = 1 / 1 + exp(-x) . Its Range is between 0 and 1. It is a S — shaped curve. It is easy to understand and apply but it has major reasons which have made it fall out of popularity –
o Vanishing gradient problem.
o Secondly, its output isn’t zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
· Tanh – Hyperbolic Tangent
o It’s mathematical formula is f(x) = 1 — exp(-2x) / 1 + exp(-2x). Now its output is zero centered because its range in between -1 to 1 i.e -1 < output < 1. Hence optimization is easier in this method hence in practice it is always preferred over Sigmoid function. But still it suffers from Vanishing gradient problem.
· RELU – Rectified Linear Unit
o It has become very popular in the past couple of years. It was recently proved that it had 6 times improvement in convergence from Tanh function. It’s just R(x) = max(0,x) i.e if x < 0 , R(x) = 0 and if x >= 0 , R(x) = x. Hence as seeing the mathematical form of this function we can see that it is very simple and efficient. Hence it avoids and rectifies vanishing gradient problem. Almost all deep learning Models use ReLu nowadays.
o But its limitation is that it should only be used within Hidden layers of a Neural Network Model.
· Leaky RELU
o Another problem with ReLu is that some gradients can be fragile during training and can die. It can cause a weight update which will make it never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.
o To fix this problem another modification was introduced called Leaky ReLu to fix the problem of dying neurons. It introduces a small slope to keep the updates alive.
The fact that the success of Machine Learning programs depends on – how fast they are able to learn the hidden patterns in training data appropriately and efficiently; Activation functions play a major role in the back propagation of the errors from the actual training data set and hence optimum updating of the weights. The above list of Activation functions serve specific purposes of learning of a data set but cannot fit for learning models in all scenarios of data classification or regression or NLP. One can also devise Activation functions for their specific problems in ML and it depends on how they understand the training data and patterns behind the learning model.
Hence this knowledge area of understanding and devising Activation functions so that the neural network in the ML program learns efficiently to deliver predictions at a higher percentage close to 100% holds a lot of opportunities and challenges in the coming time for ML discipline.