Who will cry when ReLUs die? : Exploring the World of ReLU Variants
Image Generated using Yarnit DreamBrush

Who will cry when ReLUs die? : Exploring the World of ReLU Variants

If you’re familiar with Neural Networks and Artificial Intelligence, Activation Functions is a term you must have come across. Activation functions like Sigmoid, Tanh, ReLU etc. are essential in introducing non-linearity in the neural networks. Without Activation Functions, Neural Networks would behave like simple linear models.

For a long time, people believed that Sigmoid is the most suitable choice for an activation function. The reason for this belief was the comparison to biological neurons. Reaserchers have theorised that biological neurons mimic a function similar to Sigmoid.

Rise of the ReLU

In 2010, Glorot and Bengio, in their paper, “Understanding the difficulty of training deep feedforward neural networks ”, drew upon their research to conclude the major problems of unstable gradients in neural network training was partly because of the poor choice of the activation function.

It turns out that the ReLU (Rectified Linear Unit) becomes an excellent choice for an activation function because it does not saturate for positive values and it is fast to compute. In practice, ReLU has become the default choice because of the robustness and the ease of implementation (even though it’s not differentiable at x=0)

ReLU(x) = max(0, x)
ReLU

But ReLUs die…

ReLUs are not perfect. By their very nature, ReLUs output a value of 0 for any negative input. As a result, while training, some of the neurons “die” i.e. stop outputting any value other than 0. This problem has infamously gained the term “dying ReLUs”. In some cases, it can be as drastic as 50% of the neurons dying.

ReLUs Leak to avoid death

Leaky ReLU

To address the issue of 0 value outputs Leaky ReLU becomes a viable function.

LeakyReLUα(x) = max (α*x, x) where α is a hyperparameter (typically set to .01)

The ‘α’ is the slope of the ReLU function when the value of input is less than 0. This ensures that the function never “dies” or becomes 0.

Leaky ReLU

Leaky variants mostly outperform the vanilla ReLU function.

Randomized Leaky ReLU

RReLU (Randomized Leaky ReLU) is a variant of Leaky ReLU where the ‘α’, instead of being a fixed value, is picked randomly from a range of values during training. During testing, an average value is fixed.

RReLUα(x) = max(α*x, x) where α ~ U(l, u)

RReLU also seems to act like a regularisation technique and reduce the risk of overfitting

Parametric Leaky ReLU

In PReLU (Parametric Leaky ReLU) where ‘α’ instead of being a hyperparameter is a parameter that is learnt during training.

PReLU = max(α*x, x) where α is a learnable parameter

PReLU has been observed to work well with large image datasets but runs the risk of overfitting the model on smaller datasets

Exponential advantage

Exponential Linear Unit

Exponential Linear Unit (ELU) modifies the ReLU to address the vanishing gradients problem by having an output value average closer to 0. It also has a nonzero gradient for x<0 which addresses the ‘dying’ ReLU problem. The smoothness of the unit also allows the Gradient Descent to speed up.

ELUα(x)
ELU

Though ELU, generally, outperforms the ReLU and the leaky variants, the function is slower to compute and training takes more time.

Scaled Exponential Linear Unit

Scaled Exponential Linear Unit (SELU) is, as the name suggests, a scaled variant of ELU. It addresses the vanishing/exploding gradient problem by ‘self normalising’ the output of each layer (preserving the mean of 0 and standard deviation of 1).

SELUαλ(x)

SELU works well but imposes a few conditions like standardisation of inputs, initialisation of each layer with LeCun initialisation and works only for sequential architecture.

So, which Linear Unit is better?

In general, SELU > ELU > Leaky ReLU and variants > ReLU

  • If the network architecture doesn’t meet the strict conditions for SELU, ELU may be better
  • If run-time latency is a concern, Leaky ReLU will be faster than SELU and ELU
  • RReLU works well if the model is overfitting

ReLU, however, remains the first choice because of speed and simplicity. The legacy of ReLUs, therefore, lives on, evolving and adapting to the ever-changing demands of the deep learning realm.

If you’re someone who follows Machine Learning, Data Science, Generative AI and Large Language Models let’s connect on LinkedIn — https://www.dhirubhai.net/in/abhinav-kimothi/


Are you interested in learning about Retrieval Augmented Generation? I am writing A Simple Guide to Retrieval Augmented Generation that is in development with Manning Publications Co. The first five out of the total nine scheduled chapters have been released as part of the Manning Early Access Program.

A Simple Guide to Retrieval Augmented Generation is a foundational guide for individuals looking to explore Retrieval Augmented. Generation. This book is for technology professionals who want to get introduced to the concept of Retrieval Augmented Generation and build LLM based apps. It will prove to be a handy book for beginners as well as experienced professionals. You'll also get an opportunity to code along in python but the book is intended for non-coders as well.

? You can join the MEAP here: https://mng.bz/jXJ9

???? If you like to get hands-on, the GitHub repo of the book is public. You can clone/fork it here: https://github.com/abhinav-kimothi/A-Simple-Guide-to-RAG

If you're interested in getting a sneak peek into the book, check out:?? First Chapter Summary Video: https://youtu.be/R9E2WQhNz-0?si=jPlMmCIbtabkUKgc

?? Video Introduction: https://www.youtube.com/watch?v=hHEdxptDgJM

?? Read Livebook: https://livebook.manning.com/book/a-simple-guide-to-retrieval-augmented-generation/chapter-1/v-2/?utm_source=kimothi&utm_medium=affiliate&utm_campaign=affiliate&a_aid=kimothi




Agus Fernandez

Stay Ahead with AI | Sharing 5+ Weekly AI Posts | Certified Azure AI Engineer | Ex-Multinational Team Lead | Gen AI | Business Automation

2 个月

Interesting paper! I will check it soon, thanks for sharing

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了