Choosing the Right Activation Function: A Guide for Neural Network Enthusiasts
medium.com

Choosing the Right Activation Function: A Guide for Neural Network Enthusiasts

If you're diving into the world of neural networks, one of the critical choices you'll face is selecting the right activation function for your hidden layers and output layer. This decision can significantly impact your model's performance. In this article, we’ll walk through the various activation functions, their mathematical formulas, the ranges of their derivatives, and the implications for forward and backward propagation. This article also touch on the vanishing gradient problem and how it relates to different activation functions, including how it affects weight updates and the learning process.

Understanding the Weight Update Formula

Before we dive into specific activation functions, let's break down the weight update formula used in neural networks during backpropagation:

wnew=wold?η?δ?f′(x)?x

  • wnew: The updated weight.
  • wold: The current weight.
  • η: The learning rate, a small positive value determining the step size for updating weights.
  • δ: The error term or gradient of the loss function with respect to the current weight.
  • f′(x): The derivative of the activation function.
  • x: The input to the neuron.

The vanishing gradient problem occurs when f′(x) becomes very small, causing the weight update to be negligible and slowing down the learning process.

Activation Functions for Hidden Layers

  1. Sigmoid

  • Formula: σ(x) = 1 / (1 + e^(-x))
  • Forward Propagation: Squashes input values to a range between 0 and 1, making it useful for probabilistic interpretations.
  • Range of Derivative: 0 < σ'(x) < 0.25
  • Vanishing Gradient Problem: For extreme values of x, σ(x) approaches 0 or 1, making σ'(x) close to 0, which leads to very small weight updates and slow learning.

Given:

Activation function: σ(x)=1/(1+e^?x)

Derivative: σ′(x)=σ(x)?(1?σ(x))

Let's say our input x=5x = 5x=5.

  1. Forward Propagation: σ(5)=1/1+e?5≈0.993
  2. Backward Propagation:Gradient δ=0.1
  3. Weight update: wnew=wold?0.1?0.1?0.993?(1?0.993)?5≈0.982

In this example, the weight update is relatively small due to the derivative term σ'(x)≈0.002 leading to slow learning.

2. Hyperbolic Tangent (tanh)

  • Formula: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
  • Forward Propagation: Maps inputs to a range between -1 and 1, centering the data around zero.
  • Range of Derivative: 0 < tanh'(x) ≤ 1
  • Vanishing Gradient Problem: For large values of |x|, tanh(x) approaches ±1, making tanh'(x) approach 0, causing slow learning due to minimal weight updates.

Given:

Activation function: tanh(x)=(e^x?e^?x)/(e^x+e^?x)

Derivative: tanh′(x)=1?tanh^2(x)

Let's continue with our input x=5

  1. Forward Propagation: tanh(5)=(e^5?e^?5)/(e^5+e^?5)≈0.999
  2. Backward Propagation:Gradient δ=0.1
  3. Weight update: wnew=wold?0.1?0.1?(1?0.9992)?5≈0.984

Here, the weight update is slightly larger compared to the sigmoid function due to the steeper slope of the tanh function, but it's still relatively small.

3. ReLU (Rectified Linear Unit)

  • Formula: ReLU(x) = max(0, x)
  • Forward Propagation: Allows all positive values to pass through unchanged while setting all negative values to zero.
  • Range of Derivative: {0, 1}
  • Vanishing Gradient Problem: ReLU mitigates the vanishing gradient problem because the gradient is always 1 for positive values, ensuring effective backpropagation. However, it can cause "dead neurons" where some neurons never activate (i.e., when x ≤ 0).

Given:

Activation function: ReLU(x)=max(0,x)

Derivative: ReLU′(x)={0if?x<0,

1if?x≥0}

Example with Positive Input x=5

Given:

Input: x=5

Gradient: δ=0.1

  1. Forward Propagation: ReLU(5)=5
  2. Backward Propagation:Weight update: wnew=wold?0.1×0.1×1×5=wold?0.05

Example with Negative Input x=?5

Given:

Input: x=?5

Gradient: δ=0.1

  1. Forward Propagation: ReLU(?5)=0
  2. Backward Propagation:

Weight update: No weight update occurs (ReLU′(?5)=0)

These examples illustrate how ReLU activation handles both positive and negative inputs. For positive inputs, it allows the gradient to pass through unchanged, facilitating learning. However, for negative inputs, the gradient is zero, resulting in no weight update and potentially causing the vanishing gradient problem.

4. Leaky ReLU

  • Formula: LeakyReLU(x) = max(0.01x, x)
  • Forward Propagation: Similar to ReLU but with a small slope for negative values, preventing dead neurons.
  • Range of Derivative: 0.01 ≤ LeakyReLU'(x) ≤ 1
  • Vanishing Gradient Problem: The small gradient for negative inputs helps avoid dead neurons and mitigates the vanishing gradient problem by ensuring that the gradient is never zero.

5. ELU (Exponential Linear Unit)

  • Formula: ELU(x) = { x if x > 0, α(e^x - 1) if x ≤ 0 }
  • Forward Propagation: Aims to make the mean activations closer to zero and reduce computational complexity.
  • Range of Derivative: (0, 1]
  • Vanishing Gradient Problem: ELU helps in handling vanishing gradients more effectively than ReLU and its variants because it maintains a non-zero gradient for x ≤ 0.

6. Swish

  • Formula: Swish(x) = x * σ(x)
  • Forward Propagation: A smooth, non-monotonic function that often outperforms ReLU on deeper networks.
  • Range of Derivative: Typically within (0, 1.1]
  • Vanishing Gradient Problem: The smooth and non-zero gradient ensures effective learning, reducing the risk of vanishing gradients.

Activation Functions for Output Layer

  1. Sigmoid (for binary classification)

  • Formula: σ(x) = 1 / (1 + e^(-x))
  • Forward Propagation: Useful for outputting probabilities in binary classification tasks.

2. Softmax (for multi-class classification)

  • Formula: Softmax(z_i) = e^(z_i) / ∑(j=1 to N) e^(z_j)
  • Forward Propagation: Converts logits into probabilities that sum up to 1, making it ideal for multi-class classification.

Suppose we have a set of logits (raw prediction scores) for three classes: [2.0, 1.0, 0.5]. To calculate the softmax values for these logits, we'll follow these steps:

  1. Compute the exponentials of each logit.
  2. Calculate the sum of all exponentials.
  3. Divide each exponential value by the sum obtained in step 2.

Let's do the calculations:

  1. Compute exponentials:

  • e^2.0=7.389
  • e^1.0=2.718
  • e^0.5=1.649

2. Calculate the sum of exponentials:

  • 7.389+2.718+1.649=11.756

3. Divide each exponential by the sum:

  • softmax(2.0)=7.389/11.756≈0.629
  • softmax(1.0)=2.718/11.756≈0.231
  • softmax(0.5)=1.649/11.756≈0.140

So, the softmax values for the given logits [2.0, 1.0, 0.5] are approximately [0.629, 0.231, 0.140].

This process ensures that the softmax outputs are normalized and represent probabilities that sum up to 1.

3. Linear (for regression tasks)

  • Formula: f(x) = x
  • Forward Propagation: Directly outputs the input value, suitable for regression problems where the output is a real number.

Conclusion

Choosing the right activation function is crucial for the success of your neural network. Each activation function has its strengths and weaknesses, especially concerning the vanishing gradient problem. Sigmoid and tanh functions are prone to vanishing gradients, making them less suitable for deep networks. ReLU and its variants like Leaky ReLU and ELU mitigate this issue but come with their own trade-offs. Swish offers a smooth, non-monotonic alternative that performs well on deeper networks.

Understanding the weight update formula and how each activation function's derivative affects learning can help you make informed decisions and build more effective neural networks. Happy coding!

要查看或添加评论,请登录

Jothsna Praveena , Data Scientist的更多文章

社区洞察

其他会员也浏览了