Power of Regularization: Simplifying L1 and L2 Math for Everyone
Divesh Kubal
Senior Data Scientist at CrimsonAI | Expert in Generative AI, LLMs, & Deep Learning | Specializing Model Optimization, and Scalable ML Solutions | Passionate AI Blogger & Researcher
In the previous article on - Understanding Vector Norms: A Comprehensive Guide to L1, L2, L∞, and Beyond..., we studied fundamental norms such as the L1, L2, and L∞ norms, each serving distinct purposes from penalizing coefficients in regularization techniques. This article bridges the gap between "Vector Norm" and how it is actually used in Loss Functions in L1 and L2 Regularization Strategies.
Parameter Norm Penalties
Linear models like linear and logistic regression offer simple yet effective regularization techniques. These methods typically work by adding a penalty term, Ω(θ), to the objective function, J. This penalty helps control the model's capacity, ensuring better performance and preventing overfitting.
The hyperparameter α (α ∈ [0, ∞)) balances the influence (contribution) of the norm penalty term, Ω, against the main objective (cost/error) function, J. If α is set to 0, there’s NO regularization. Increasing α adds more regularization, helping to control model complexity.
When our training algorithm minimizes the regularized objective function, J?, it reduces both the original objective, J, on the training data and the size of the parameters, θ. The choice of the norm penalty (L1, L2, etc), Ω, can lead to different preferred solutions.
Why to penalize only weights and not biases?
In neural networks, we usually apply a parameter norm penalty, Ω, that targets only the weights of the affine transformations in each layer, leaving the biases unregularized. This is because biases generally require less data to fit accurately compared to weights. While weights describe interactions between two variables and need diverse data to fit well, biases control individual variables, minimizing the risk of introducing too much variance. Regularizing biases can even lead to underfitting.
Note:
In neural networks, ideal situation is to apply different penalty values with unique α coefficients for each layer. However, tuning multiple hyperparameters can be costly in terms of time and resources. To simplify this, it’s often practical to use the same weight decay across all layers, reducing the complexity of the search space.
L2 Parameter Regularization
The L2 parameter norm penalty, often referred to as weight decay, is a regularization technique that pushes weights closer to zero (origin). It achieves this by adding the following regularizing term to the objective function.
Did you know? L2 regularization is known as Ridge regression or Tikhonov regularization.
Mathematical Equations of L2 Regularization
To better understand how weight decay regularization works, we can examine the gradient of the regularized objective function. For simplicity, we'll assume there is no bias parameter, meaning θ is just w.
Let’s recall the equation,
Here, we will replace α with L2 regularization equation,
The regularized cost function now becomes:
Lets compute the Gradient of the Cost Function:
领英推荐
To update the weights with a single gradient step, we use the following update equation:
We can summarize the above equation or rewrite the above equation as,
The addition of the weight decay term modifies the learning rule by shrinking the weight vector by a constant factor at each step, just before applying the usual gradient update. In this way the update happens in single step using weight decay.
L1 Regularization
L1 regularization for the model parameter w is defined as follows:
that is, as the sum of absolute values of the individual parameters.
Mathematical Equations of L1 Regularization:
Let’s explore the impact of L1 regularization on the simple linear regression model we previously analyzed with L2 regularization, omitting the bias parameter. We will focus on the key differences between L1 and L2 regularization. Like L2 weight decay, L1 regularization also adjusts its strength using a positive hyperparameter α to scale the penalty term, Ω.
The regularized Cost Function can be written as:
Lets compute the Gradient of the Cost Function:
where sign(w) is simply the sign of w applied element-wise.
How L1 Regularization is used for Feature Selection?
L1 regularization has a distinctly different effect from L2 regularization. Unlike L2, where the regularization contribution to the gradient scales linearly with each weight wi, L1 regularization uses a constant factor that depends on the sign of wi. This leads to sparser solutions, meaning some parameters can be optimally set to zero. This sparsity is a key difference from L2 regularization.
The sparsity introduced by L1 regularization is valuable for feature selection, as it simplifies machine learning tasks by identifying which features to keep. The well-known LASSO (Least Absolute Shrinkage and Selection Operator) model combines an L1 penalty with a linear model and least squares cost function. As a result, some weights are driven to zero, indicating that those features can be safely ignored (unimportant weights/features).
Conclusion
In conclusion, parameter norm penalties like L1 and L2 regularization are essential methods in linear models and neural networks for preventing overfitting and enhancing model performance. L2 regularization, or weight decay, works by shrinking weights towards zero, while L1 regularization promotes sparsity, leading to simpler models that can effectively perform feature selection. While applying unique penalty values for each layer in neural networks is ideal, using a consistent weight decay across layers often streamlines the training process. Understanding these regularization techniques allows for more robust and efficient machine learning models.
In the next article, we’ll explore other regularization strategies. Stay tuned!
Assistant Professor in A. C. Patil College of Engineering
7 个月What purpose is L infinity norm is used ?