登录查看更多内容

Discovering SwiGLU: The Activation Function Powering Modern LLMs

Sakshi Singh

Associate Engineer-Data @Shell | Kaggle Discussion Expert | Generative AI | Data Science

发布日期: 2025年1月28日

Activation functions are the unsung heroes of deep learning. They’re at the core of every neural network, influencing how the model learns, generalizes, and predicts. Among them, ReLU (Rectified Linear Unit) has been the gold standard for years, thanks to its simplicity and effectiveness. ??

But times are changing. The latest Large Language Models (LLMs) like LLaMA and Google’s PaLM are adopting a newer activation function: SwiGLU.

So, why the shift? Let’s explore.

The Limitations of ReLU ??

While ReLU has proven itself, it’s not without drawbacks:

1?? Not Differentiable at Zero ReLU’s gradient is undefined at x=0, which can lead to optimization issues.

2?? Saturation on the Negative Side For negative values, ReLU outputs a constant zero, causing the “Dying ReLU” problem where certain neurons never activate.

3?? Monotonicity ReLU is strictly increasing, which can sometimes limit the expressiveness of the network.

Enter SwiGLU: A Game-Changer in Activation Functions ?

SwiGLU, short for Swish-Gated Linear Unit, combines two innovative ideas:

Swish Activation Function: A smoother, more expressive activation function compared to ReLU.
GLU (Gated Linear Unit): A gating mechanism that adds more control over which parts of the input get activated.

Together, they form an activation powerhouse that’s both efficient and expressive.

Let’s Break It Down

What is GLU (Gated Linear Unit)? ??

The Gated Linear Unit (GLU) is an activation mechanism designed to improve the learning dynamics of deep models by introducing gating

Here’s how it works:

The Formula for GLU

The GLU layer takes an input vector x and computes the output as:

GLU(x)= (xW1+b1)? σ (xW2+b2)

Here’s what the terms mean:

x: Input vector.
W1,W2: Weight matrices for two linear transformations.
b1,b2: Bias terms for each transformation.
σ: Sigmoid activation function.
?: Element-wise multiplication.

领英推荐

The Quest for Interpretable Machine Learning Models

Vizuara 9 个月前

Configuring a Neural Network Output Layer

Enthought 1 年前

What Is Stable Diffusion and How Does It Work?

Politetech Software 2 年前

In SwiGlu, we use Swish activation function in the place sigmoid activation function.

What is the Swish Activation Function? ??

The Swish function is defined as:

Where:

σ(x)= 1/1 + e^{-x}

It’s a smooth, non-monotonic activation function that blends linear and non-linear behaviors.

Key Properties of Swish:

Non-zero gradients for negative inputs: Unlike ReLU, Swish doesn’t discard negative values entirely, allowing more nuanced learning.
Smoothness: The curve is differentiable everywhere, making optimization easier.
Expressiveness: Swish captures richer patterns in data than simple activation functions like ReLU.

Comparison of ReLU and Swish

While SwiGLU (and Swish) often outperforms ReLU in modern models, ReLU still has its edge in some scenarios:

1?? Simplicity & Efficiency: ReLU is lightweight and computationally simple with its straightforward operation: ReLU(x) = max(0, x). In contrast, SwiGLU introduces extra steps like sigmoid and element-wise multiplication, making ReLU faster for simpler models or real-time systems.

2?? Sparsity & Robustness: ReLU naturally produces sparse activations (many zero outputs), which can be beneficial for specific architectures and scenarios where sparsity improves learning or reduces overfitting.

So while SwiGLU shines in LLMs, ReLU remains a solid choice for certain tasks and smaller models!

Tanishq Singh

Kaggle Discussion Expert | Researcher ??| Learning and Teaching Machine Learning????

1 个月

Quite a detailed explanation of Swish, and I like the comparison with ReLU as well! Even if it's not differentiable it surprisingly worked so well for quite some time! Waiting for more such posts??

1 次回应

查看更多评论

要查看或添加评论，请登录

Sakshi Singh的更多文章

Rotary Positional Embeddings Explained Simply ??

2025年3月4日

Rotary Positional Embeddings Explained Simply ??

Let's Talk About Positional Embeddings! ?? Ever played a game where the order of words totally changes the meaning of a…

1 条评论
RMSNorm: The Simplified Powerhouse Behind Modern LLMs

2025年1月10日

RMSNorm: The Simplified Powerhouse Behind Modern LLMs

In the last article, we dove into Layer Normalization (LN)—a key component in transformers—and understood why it…

1 条评论
Understanding Layer Normalization: Why and How It Works

2025年1月5日

Understanding Layer Normalization: Why and How It Works

In the world of deep learning, you’ve likely come across the term Normalization—a technique used to normalize the…

5 条评论
Unpacking Self-Attention: Diving Deeper into the Transformer’s Core (Part 2)

2025年1月1日

Unpacking Self-Attention: Diving Deeper into the Transformer’s Core (Part 2)

Hello everyone! ?? In the last article, we covered the crux of self-attention that gave us a foundation for…

3 条评论
Self-Attention: The Superpower Behind Transformers (Part 1)

2024年12月28日

Self-Attention: The Superpower Behind Transformers (Part 1)

Have you ever been at a party, trying to follow multiple conversations at once? You naturally tune in to the person…

2 条评论

See all articles

Discovering SwiGLU: The Activation Function Powering Modern LLMs

Sakshi Singh

Associate Engineer-Data @Shell | Kaggle Discussion Expert | Generative AI | Data Science

The Limitations of ReLU ??

Enter SwiGLU: A Game-Changer in Activation Functions ?

Let’s Break It Down

What is GLU (Gated Linear Unit)? ??

The Formula for GLU

领英推荐

What is the Swish Activation Function? ??

While SwiGLU (and Swish) often outperforms ReLU in modern models, ReLU still has its edge in some scenarios:

Sakshi Singh的更多文章

社区洞察

其他会员也浏览了

Neural Network Chain Rule: Understanding the Backpropagation Algorithm in Deep Learning

Deep Learning Neural Network simple way to explain

Perceptron Based Linear Regression model

Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

Feedforward vs Backpropagation ANN

Do You Understand The Difference Between Deep Learning And Neural Networks?

Transformers without pain ??

BxD Primer Series: Attention Mechanism

Attention

Hello World - Machine Learning & Neural Network

The Limitations of ReLU ??

Enter SwiGLU: A Game-Changer in Activation Functions ?

Let’s Break It Down

What is GLU (Gated Linear Unit)? ??

The Formula for GLU

领英推荐

What is the Swish Activation Function? ??

While SwiGLU (and Swish) often outperforms ReLU in modern models, ReLU still has its edge in some scenarios:

Sakshi Singh的更多文章

Rotary Positional Embeddings Explained Simply ??

RMSNorm: The Simplified Powerhouse Behind Modern LLMs

Understanding Layer Normalization: Why and How It Works

Unpacking Self-Attention: Diving Deeper into the Transformer’s Core (Part 2)

Self-Attention: The Superpower Behind Transformers (Part 1)

社区洞察

其他会员也浏览了

Neural Network Chain Rule: Understanding the Backpropagation Algorithm in Deep Learning

Deep Learning Neural Network simple way to explain

Perceptron Based Linear Regression model

Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

Feedforward vs Backpropagation ANN

Do You Understand The Difference Between Deep Learning And Neural Networks?

Transformers without pain ??

BxD Primer Series: Attention Mechanism

Attention

Hello World - Machine Learning & Neural Network