Who will cry when ReLUs die? : Exploring the World of ReLU Variants
If you’re familiar with Neural Networks and Artificial Intelligence, Activation Functions is a term you must have come across. Activation functions like Sigmoid, Tanh, ReLU etc. are essential in introducing non-linearity in the neural networks. Without Activation Functions, Neural Networks would behave like simple linear models.
For a long time, people believed that Sigmoid is the most suitable choice for an activation function. The reason for this belief was the comparison to biological neurons. Reaserchers have theorised that biological neurons mimic a function similar to Sigmoid.
Rise of the ReLU
In 2010, Glorot and Bengio, in their paper, “Understanding the difficulty of training deep feedforward neural networks ”, drew upon their research to conclude the major problems of unstable gradients in neural network training was partly because of the poor choice of the activation function.
It turns out that the ReLU (Rectified Linear Unit) becomes an excellent choice for an activation function because it does not saturate for positive values and it is fast to compute. In practice, ReLU has become the default choice because of the robustness and the ease of implementation (even though it’s not differentiable at x=0)
ReLU(x) = max(0, x)
But ReLUs die…
ReLUs are not perfect. By their very nature, ReLUs output a value of 0 for any negative input. As a result, while training, some of the neurons “die” i.e. stop outputting any value other than 0. This problem has infamously gained the term “dying ReLUs”. In some cases, it can be as drastic as 50% of the neurons dying.
ReLUs Leak to avoid death
Leaky ReLU
To address the issue of 0 value outputs Leaky ReLU becomes a viable function.
LeakyReLUα(x) = max (α*x, x) where α is a hyperparameter (typically set to .01)
The ‘α’ is the slope of the ReLU function when the value of input is less than 0. This ensures that the function never “dies” or becomes 0.
Leaky variants mostly outperform the vanilla ReLU function.
Randomized Leaky ReLU
RReLU (Randomized Leaky ReLU) is a variant of Leaky ReLU where the ‘α’, instead of being a fixed value, is picked randomly from a range of values during training. During testing, an average value is fixed.
RReLUα(x) = max(α*x, x) where α ~ U(l, u)
RReLU also seems to act like a regularisation technique and reduce the risk of overfitting
Parametric Leaky ReLU
In PReLU (Parametric Leaky ReLU) where ‘α’ instead of being a hyperparameter is a parameter that is learnt during training.
PReLU = max(α*x, x) where α is a learnable parameter
PReLU has been observed to work well with large image datasets but runs the risk of overfitting the model on smaller datasets
Exponential advantage
领英推荐
Exponential Linear Unit
Exponential Linear Unit (ELU) modifies the ReLU to address the vanishing gradients problem by having an output value average closer to 0. It also has a nonzero gradient for x<0 which addresses the ‘dying’ ReLU problem. The smoothness of the unit also allows the Gradient Descent to speed up.
ELUα(x)
Though ELU, generally, outperforms the ReLU and the leaky variants, the function is slower to compute and training takes more time.
Scaled Exponential Linear Unit
Scaled Exponential Linear Unit (SELU) is, as the name suggests, a scaled variant of ELU. It addresses the vanishing/exploding gradient problem by ‘self normalising’ the output of each layer (preserving the mean of 0 and standard deviation of 1).
SELUαλ(x)
SELU works well but imposes a few conditions like standardisation of inputs, initialisation of each layer with LeCun initialisation and works only for sequential architecture.
So, which Linear Unit is better?
In general, SELU > ELU > Leaky ReLU and variants > ReLU
ReLU, however, remains the first choice because of speed and simplicity. The legacy of ReLUs, therefore, lives on, evolving and adapting to the ever-changing demands of the deep learning realm.
If you’re someone who follows Machine Learning, Data Science, Generative AI and Large Language Models let’s connect on LinkedIn — https://www.dhirubhai.net/in/abhinav-kimothi/
Are you interested in learning about Retrieval Augmented Generation? I am writing A Simple Guide to Retrieval Augmented Generation that is in development with Manning Publications Co. The first five out of the total nine scheduled chapters have been released as part of the Manning Early Access Program.
A Simple Guide to Retrieval Augmented Generation is a foundational guide for individuals looking to explore Retrieval Augmented. Generation. This book is for technology professionals who want to get introduced to the concept of Retrieval Augmented Generation and build LLM based apps. It will prove to be a handy book for beginners as well as experienced professionals. You'll also get an opportunity to code along in python but the book is intended for non-coders as well.
? You can join the MEAP here: https://mng.bz/jXJ9
???? If you like to get hands-on, the GitHub repo of the book is public. You can clone/fork it here: https://github.com/abhinav-kimothi/A-Simple-Guide-to-RAG
If you're interested in getting a sneak peek into the book, check out:?? First Chapter Summary Video: https://youtu.be/R9E2WQhNz-0?si=jPlMmCIbtabkUKgc
?? Video Introduction: https://www.youtube.com/watch?v=hHEdxptDgJM
?? Read Livebook: https://livebook.manning.com/book/a-simple-guide-to-retrieval-augmented-generation/chapter-1/v-2/?utm_source=kimothi&utm_medium=affiliate&utm_campaign=affiliate&a_aid=kimothi
Stay Ahead with AI | Sharing 5+ Weekly AI Posts | Certified Azure AI Engineer | Ex-Multinational Team Lead | Gen AI | Business Automation
2 个月Interesting paper! I will check it soon, thanks for sharing