登录查看更多内容

Choosing the Right Activation Function: A Guide for Neural Network Enthusiasts

Jothsna Praveena , Data Scientist

Machine Learning Engineer| GenAI | LLM | RAG | AWS Certified Machine Learning Associate| AI & Cloud Practitioner | NLP | Deep Learning | Project Management | Energy & Sustainability | MS in Data Analytics

发布日期: 2024年6月10日

If you're diving into the world of neural networks, one of the critical choices you'll face is selecting the right activation function for your hidden layers and output layer. This decision can significantly impact your model's performance. In this article, we’ll walk through the various activation functions, their mathematical formulas, the ranges of their derivatives, and the implications for forward and backward propagation. This article also touch on the vanishing gradient problem and how it relates to different activation functions, including how it affects weight updates and the learning process.

Understanding the Weight Update Formula

Before we dive into specific activation functions, let's break down the weight update formula used in neural networks during backpropagation:

wnew=wold?η?δ?f′(x)?x

wnew: The updated weight.
wold: The current weight.
η: The learning rate, a small positive value determining the step size for updating weights.
δ: The error term or gradient of the loss function with respect to the current weight.
f′(x): The derivative of the activation function.
x: The input to the neuron.

The vanishing gradient problem occurs when f′(x) becomes very small, causing the weight update to be negligible and slowing down the learning process.

Activation Functions for Hidden Layers

Sigmoid

Formula: σ(x) = 1 / (1 + e^(-x))
Forward Propagation: Squashes input values to a range between 0 and 1, making it useful for probabilistic interpretations.
Range of Derivative: 0 < σ'(x) < 0.25
Vanishing Gradient Problem: For extreme values of x, σ(x) approaches 0 or 1, making σ'(x) close to 0, which leads to very small weight updates and slow learning.

Given:

Activation function: σ(x)=1/(1+e^?x)

Derivative: σ′(x)=σ(x)?(1?σ(x))

Let's say our input x=5x = 5x=5.

Forward Propagation: σ(5)=1/1+e?5≈0.993
Backward Propagation:Gradient δ=0.1
Weight update: wnew=wold?0.1?0.1?0.993?(1?0.993)?5≈0.982

In this example, the weight update is relatively small due to the derivative term σ'(x)≈0.002 leading to slow learning.

2. Hyperbolic Tangent (tanh)

Formula: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Forward Propagation: Maps inputs to a range between -1 and 1, centering the data around zero.
Range of Derivative: 0 < tanh'(x) ≤ 1
Vanishing Gradient Problem: For large values of |x|, tanh(x) approaches ±1, making tanh'(x) approach 0, causing slow learning due to minimal weight updates.

Given:

Activation function: tanh(x)=(e^x?e^?x)/(e^x+e^?x)

Derivative: tanh′(x)=1?tanh^2(x)

Let's continue with our input x=5

Forward Propagation: tanh(5)=(e^5?e^?5)/(e^5+e^?5)≈0.999
Backward Propagation:Gradient δ=0.1
Weight update: wnew=wold?0.1?0.1?(1?0.9992)?5≈0.984

Here, the weight update is slightly larger compared to the sigmoid function due to the steeper slope of the tanh function, but it's still relatively small.

3. ReLU (Rectified Linear Unit)

Formula: ReLU(x) = max(0, x)
Forward Propagation: Allows all positive values to pass through unchanged while setting all negative values to zero.
Range of Derivative: {0, 1}
Vanishing Gradient Problem: ReLU mitigates the vanishing gradient problem because the gradient is always 1 for positive values, ensuring effective backpropagation. However, it can cause "dead neurons" where some neurons never activate (i.e., when x ≤ 0).

Given:

Activation function: ReLU(x)=max(0,x)

Derivative: ReLU′(x)={0if?x<0,

1if?x≥0}

Example with Positive Input x=5

Given:

Input: x=5

Gradient: δ=0.1

Forward Propagation: ReLU(5)=5
Backward Propagation:Weight update: wnew=wold?0.1×0.1×1×5=wold?0.05

领英推荐

Dissecting Forward Propagation in Neural Networks

Saurav Prateek 1 个月前

Who will cry when ReLUs die? : Exploring the World of…

Abhinav Kimothi 6 个月前

Predicting volatility with neural networks

Ralph Sueppel 3 年前

Example with Negative Input x=?5

Given:

Input: x=?5

Gradient: δ=0.1

Forward Propagation: ReLU(?5)=0
Backward Propagation:

Weight update: No weight update occurs (ReLU′(?5)=0)

These examples illustrate how ReLU activation handles both positive and negative inputs. For positive inputs, it allows the gradient to pass through unchanged, facilitating learning. However, for negative inputs, the gradient is zero, resulting in no weight update and potentially causing the vanishing gradient problem.

4. Leaky ReLU

Formula: LeakyReLU(x) = max(0.01x, x)
Forward Propagation: Similar to ReLU but with a small slope for negative values, preventing dead neurons.
Range of Derivative: 0.01 ≤ LeakyReLU'(x) ≤ 1
Vanishing Gradient Problem: The small gradient for negative inputs helps avoid dead neurons and mitigates the vanishing gradient problem by ensuring that the gradient is never zero.

5. ELU (Exponential Linear Unit)

Formula: ELU(x) = { x if x > 0, α(e^x - 1) if x ≤ 0 }
Forward Propagation: Aims to make the mean activations closer to zero and reduce computational complexity.
Range of Derivative: (0, 1]
Vanishing Gradient Problem: ELU helps in handling vanishing gradients more effectively than ReLU and its variants because it maintains a non-zero gradient for x ≤ 0.

6. Swish

Formula: Swish(x) = x * σ(x)
Forward Propagation: A smooth, non-monotonic function that often outperforms ReLU on deeper networks.
Range of Derivative: Typically within (0, 1.1]
Vanishing Gradient Problem: The smooth and non-zero gradient ensures effective learning, reducing the risk of vanishing gradients.

Activation Functions for Output Layer

Sigmoid (for binary classification)

Formula: σ(x) = 1 / (1 + e^(-x))
Forward Propagation: Useful for outputting probabilities in binary classification tasks.

2. Softmax (for multi-class classification)

Formula: Softmax(z_i) = e^(z_i) / ∑(j=1 to N) e^(z_j)
Forward Propagation: Converts logits into probabilities that sum up to 1, making it ideal for multi-class classification.

Suppose we have a set of logits (raw prediction scores) for three classes: [2.0, 1.0, 0.5]. To calculate the softmax values for these logits, we'll follow these steps:

Compute the exponentials of each logit.
Calculate the sum of all exponentials.
Divide each exponential value by the sum obtained in step 2.

Let's do the calculations:

Compute exponentials:

e^2.0=7.389
e^1.0=2.718
e^0.5=1.649

2. Calculate the sum of exponentials:

7.389+2.718+1.649=11.756

3. Divide each exponential by the sum:

softmax(2.0)=7.389/11.756≈0.629
softmax(1.0)=2.718/11.756≈0.231
softmax(0.5)=1.649/11.756≈0.140

So, the softmax values for the given logits [2.0, 1.0, 0.5] are approximately [0.629, 0.231, 0.140].

This process ensures that the softmax outputs are normalized and represent probabilities that sum up to 1.

3. Linear (for regression tasks)

Formula: f(x) = x
Forward Propagation: Directly outputs the input value, suitable for regression problems where the output is a real number.

Conclusion

Choosing the right activation function is crucial for the success of your neural network. Each activation function has its strengths and weaknesses, especially concerning the vanishing gradient problem. Sigmoid and tanh functions are prone to vanishing gradients, making them less suitable for deep networks. ReLU and its variants like Leaky ReLU and ELU mitigate this issue but come with their own trade-offs. Swish offers a smooth, non-monotonic alternative that performs well on deeper networks.

Understanding the weight update formula and how each activation function's derivative affects learning can help you make informed decisions and build more effective neural networks. Happy coding!

要查看或添加评论，请登录

Jothsna Praveena , Data Scientist的更多文章

Handling Imbalanced Data in Regression and Classification

2024年9月16日

Handling Imbalanced Data in Regression and Classification

When we talk about imbalanced data, most discussions revolve around classification problems. However, imbalance can…

1 条评论
LORA and QLORA - Fine-Tuning AI Models Efficiently

2024年6月30日

LORA and QLORA - Fine-Tuning AI Models Efficiently

Fine-tuning is a key step in adapting Large Language Models (LLMs) to specific tasks or domains, enhancing their…
Unveiling the Single Perceptron Model: A Simple Explanation

2024年6月9日

Unveiling the Single Perceptron Model: A Simple Explanation

Welcome to the world of Deep Learning! Today, we're diving into the single perceptron model – a fundamental concept in…

6 条评论
Understanding Instance-Based Learning vs. Model-Based Learning: Which is Right for Your AI Project?

2024年6月8日

Understanding Instance-Based Learning vs. Model-Based Learning: Which is Right for Your AI Project?

In the ever-evolving world of machine learning, choosing the right approach can significantly impact the success of…

3 条评论

Choosing the Right Activation Function: A Guide for Neural Network Enthusiasts

Jothsna Praveena , Data Scientist

Machine Learning Engineer| GenAI | LLM | RAG | AWS Certified Machine Learning Associate| AI & Cloud Practitioner | NLP | Deep Learning | Project Management | Energy & Sustainability | MS in Data Analytics

Understanding the Weight Update Formula

Activation Functions for Hidden Layers

Example with Positive Input x=5

领英推荐

Example with Negative Input x=?5

Activation Functions for Output Layer

Conclusion

Jothsna Praveena , Data Scientist的更多文章

社区洞察

其他会员也浏览了

Recurrent Neural Network(RNN)

Neural Network Simplified

Unleashing MobileNetV2: Efficient CNN Insights

The Vanishing Gradient Problem?

4 Neural Network Activation Functions you should keep in mind

Multilayer Perceptron

Radial Basis Functions Neural Networks All We Need To Know

U-Net: A Convolutional Neural Network (CNN) Model, Not a Transformer

Radial Basis Functions Neural Networks All We Need To Know

Explain the Types of Activation Function in Neural Network

Understanding the Weight Update Formula

Activation Functions for Hidden Layers

Example with Positive Input x=5

领英推荐

Example with Negative Input x=?5

Activation Functions for Output Layer

Conclusion

Jothsna Praveena , Data Scientist的更多文章

Handling Imbalanced Data in Regression and Classification

LORA and QLORA - Fine-Tuning AI Models Efficiently

Unveiling the Single Perceptron Model: A Simple Explanation

Understanding Instance-Based Learning vs. Model-Based Learning: Which is Right for Your AI Project?

社区洞察

其他会员也浏览了

Recurrent Neural Network(RNN)

Neural Network Simplified

Unleashing MobileNetV2: Efficient CNN Insights

The Vanishing Gradient Problem?

4 Neural Network Activation Functions you should keep in mind

Multilayer Perceptron

Radial Basis Functions Neural Networks All We Need To Know

U-Net: A Convolutional Neural Network (CNN) Model, Not a Transformer

Radial Basis Functions Neural Networks All We Need To Know

Explain the Types of Activation Function in Neural Network