Choosing the Right Activation Function: A Guide for Neural Network Enthusiasts
Jothsna Praveena , Data Scientist
Machine Learning Engineer| GenAI | LLM | RAG | AWS Certified Machine Learning Associate| AI & Cloud Practitioner | NLP | Deep Learning | Project Management | Energy & Sustainability | MS in Data Analytics
If you're diving into the world of neural networks, one of the critical choices you'll face is selecting the right activation function for your hidden layers and output layer. This decision can significantly impact your model's performance. In this article, we’ll walk through the various activation functions, their mathematical formulas, the ranges of their derivatives, and the implications for forward and backward propagation. This article also touch on the vanishing gradient problem and how it relates to different activation functions, including how it affects weight updates and the learning process.
Understanding the Weight Update Formula
Before we dive into specific activation functions, let's break down the weight update formula used in neural networks during backpropagation:
wnew=wold?η?δ?f′(x)?x
The vanishing gradient problem occurs when f′(x) becomes very small, causing the weight update to be negligible and slowing down the learning process.
Activation Functions for Hidden Layers
Given:
Activation function: σ(x)=1/(1+e^?x)
Derivative: σ′(x)=σ(x)?(1?σ(x))
Let's say our input x=5x = 5x=5.
In this example, the weight update is relatively small due to the derivative term σ'(x)≈0.002 leading to slow learning.
2. Hyperbolic Tangent (tanh)
Given:
Activation function: tanh(x)=(e^x?e^?x)/(e^x+e^?x)
Derivative: tanh′(x)=1?tanh^2(x)
Let's continue with our input x=5
Here, the weight update is slightly larger compared to the sigmoid function due to the steeper slope of the tanh function, but it's still relatively small.
3. ReLU (Rectified Linear Unit)
Given:
Activation function: ReLU(x)=max(0,x)
Derivative: ReLU′(x)={0if?x<0,
1if?x≥0}
Example with Positive Input x=5
Given:
Input: x=5
Gradient: δ=0.1
领英推荐
Example with Negative Input x=?5
Given:
Input: x=?5
Gradient: δ=0.1
Weight update: No weight update occurs (ReLU′(?5)=0)
These examples illustrate how ReLU activation handles both positive and negative inputs. For positive inputs, it allows the gradient to pass through unchanged, facilitating learning. However, for negative inputs, the gradient is zero, resulting in no weight update and potentially causing the vanishing gradient problem.
4. Leaky ReLU
5. ELU (Exponential Linear Unit)
6. Swish
Activation Functions for Output Layer
2. Softmax (for multi-class classification)
Suppose we have a set of logits (raw prediction scores) for three classes: [2.0, 1.0, 0.5]. To calculate the softmax values for these logits, we'll follow these steps:
Let's do the calculations:
2. Calculate the sum of exponentials:
3. Divide each exponential by the sum:
So, the softmax values for the given logits [2.0, 1.0, 0.5] are approximately [0.629, 0.231, 0.140].
This process ensures that the softmax outputs are normalized and represent probabilities that sum up to 1.
3. Linear (for regression tasks)
Conclusion
Choosing the right activation function is crucial for the success of your neural network. Each activation function has its strengths and weaknesses, especially concerning the vanishing gradient problem. Sigmoid and tanh functions are prone to vanishing gradients, making them less suitable for deep networks. ReLU and its variants like Leaky ReLU and ELU mitigate this issue but come with their own trade-offs. Swish offers a smooth, non-monotonic alternative that performs well on deeper networks.
Understanding the weight update formula and how each activation function's derivative affects learning can help you make informed decisions and build more effective neural networks. Happy coding!