登录查看更多内容

Power of Regularization: Simplifying L1 and L2 Math for Everyone

Divesh Kubal

Senior Data Scientist at CrimsonAI | Expert in Generative AI, LLMs, & Deep Learning | Specializing Model Optimization, and Scalable ML Solutions | Passionate AI Blogger & Researcher

发布日期: 2024年7月12日

In the previous article on - Understanding Vector Norms: A Comprehensive Guide to L1, L2, L∞, and Beyond..., we studied fundamental norms such as the L1, L2, and L∞ norms, each serving distinct purposes from penalizing coefficients in regularization techniques. This article bridges the gap between "Vector Norm" and how it is actually used in Loss Functions in L1 and L2 Regularization Strategies.

Parameter Norm Penalties

Linear models like linear and logistic regression offer simple yet effective regularization techniques. These methods typically work by adding a penalty term, Ω(θ), to the objective function, J. This penalty helps control the model's capacity, ensuring better performance and preventing overfitting.

The hyperparameter α (α ∈ [0, ∞)) balances the influence (contribution) of the norm penalty term, Ω, against the main objective (cost/error) function, J. If α is set to 0, there’s NO regularization. Increasing α adds more regularization, helping to control model complexity.

When our training algorithm minimizes the regularized objective function, J?, it reduces both the original objective, J, on the training data and the size of the parameters, θ. The choice of the norm penalty (L1, L2, etc), Ω, can lead to different preferred solutions.

Why to penalize only weights and not biases?

In neural networks, we usually apply a parameter norm penalty, Ω, that targets only the weights of the affine transformations in each layer, leaving the biases unregularized. This is because biases generally require less data to fit accurately compared to weights. While weights describe interactions between two variables and need diverse data to fit well, biases control individual variables, minimizing the risk of introducing too much variance. Regularizing biases can even lead to underfitting.

Note:

In neural networks, ideal situation is to apply different penalty values with unique α coefficients for each layer. However, tuning multiple hyperparameters can be costly in terms of time and resources. To simplify this, it’s often practical to use the same weight decay across all layers, reducing the complexity of the search space.

L2 Parameter Regularization

The L2 parameter norm penalty, often referred to as weight decay, is a regularization technique that pushes weights closer to zero (origin). It achieves this by adding the following regularizing term to the objective function.

Did you know? L2 regularization is known as Ridge regression or Tikhonov regularization.

Mathematical Equations of L2 Regularization

To better understand how weight decay regularization works, we can examine the gradient of the regularized objective function. For simplicity, we'll assume there is no bias parameter, meaning θ is just w.

Let’s recall the equation,

Here, we will replace α with L2 regularization equation,

The regularized cost function now becomes:

Lets compute the Gradient of the Cost Function:

领英推荐

Transformers, Positional encoding and countering…

Kallisto AI 2 个月前

GenAI Evaluation Metrics: Your Best Loss Functions to…

Vincent Granville 9 个月前

Massively Speed-Up your Learning Algorithm, with…

Vincent Granville 1 年前

To update the weights with a single gradient step, we use the following update equation:

We can summarize the above equation or rewrite the above equation as,

The addition of the weight decay term modifies the learning rule by shrinking the weight vector by a constant factor at each step, just before applying the usual gradient update. In this way the update happens in single step using weight decay.

L1 Regularization

L1 regularization for the model parameter w is defined as follows:

that is, as the sum of absolute values of the individual parameters.

Mathematical Equations of L1 Regularization:

Let’s explore the impact of L1 regularization on the simple linear regression model we previously analyzed with L2 regularization, omitting the bias parameter. We will focus on the key differences between L1 and L2 regularization. Like L2 weight decay, L1 regularization also adjusts its strength using a positive hyperparameter α to scale the penalty term, Ω.

The regularized Cost Function can be written as:

Lets compute the Gradient of the Cost Function:

where sign(w) is simply the sign of w applied element-wise.

How L1 Regularization is used for Feature Selection?

L1 regularization has a distinctly different effect from L2 regularization. Unlike L2, where the regularization contribution to the gradient scales linearly with each weight wi, L1 regularization uses a constant factor that depends on the sign of wi. This leads to sparser solutions, meaning some parameters can be optimally set to zero. This sparsity is a key difference from L2 regularization.

The sparsity introduced by L1 regularization is valuable for feature selection, as it simplifies machine learning tasks by identifying which features to keep. The well-known LASSO (Least Absolute Shrinkage and Selection Operator) model combines an L1 penalty with a linear model and least squares cost function. As a result, some weights are driven to zero, indicating that those features can be safely ignored (unimportant weights/features).

Conclusion

In conclusion, parameter norm penalties like L1 and L2 regularization are essential methods in linear models and neural networks for preventing overfitting and enhancing model performance. L2 regularization, or weight decay, works by shrinking weights towards zero, while L1 regularization promotes sparsity, leading to simpler models that can effectively perform feature selection. While applying unique penalty values for each layer in neural networks is ideal, using a consistent weight decay across layers often streamlines the training process. Understanding these regularization techniques allows for more robust and efficient machine learning models.

In the next article, we’ll explore other regularization strategies. Stay tuned!

Amol Patil

Assistant Professor in A. C. Patil College of Engineering

7 个月

What purpose is L infinity norm is used ?

1 次回应

查看更多评论

要查看或添加评论，请登录

Divesh Kubal的更多文章

FlashAttention-3: Fast, Accurate, and Efficient AI with Asynchrony and Low-Precision

2024年8月9日

FlashAttention-3: Fast, Accurate, and Efficient AI with Asynchrony and Low-Precision

Transformers are a type of machine learning model used for tasks like understanding language and analyzing images. They…
How Statistical Significance Can Change Your Research Outcomes

2024年8月1日

How Statistical Significance Can Change Your Research Outcomes

Understanding "Statistical Significance" can be tricky for many people new to statistics, but it doesn't have to be…
How to Craft a Perfect Hypothesis? - Ultimate Guide to NULL and RESEARCH Hypothesis

2024年7月28日

How to Craft a Perfect Hypothesis? - Ultimate Guide to NULL and RESEARCH Hypothesis

Ever wondered how scientists and researchers come up with those groundbreaking ideas? It all starts with a well-crafted…
Adam, AdaGrad, RMSProp, Delta-Bar-Delta - Which Learning Rate Strategy Will Enhance Your Model?

2024年7月18日

Adam, AdaGrad, RMSProp, Delta-Bar-Delta - Which Learning Rate Strategy Will Enhance Your Model?

Neural network researchers have found that setting the learning rate is tricky but crucial, as it heavily influences…

2 条评论
Tired of Slow Learning? Momentum Could Be the Boost Stochastic Gradient Descent Needs!

2024年7月17日

Tired of Slow Learning? Momentum Could Be the Boost Stochastic Gradient Descent Needs!

Stochastic gradient descent is a popular way to optimize learning, but it can sometimes be a bit slow. That's where…

2 条评论
Why Stochastic Gradient Descent (SGD) is a Game Changer?

2024年7月15日

Why Stochastic Gradient Descent (SGD) is a Game Changer?

Stochastic Gradient Descent (SGD) is one of the most popular methods for improving machine learning models, especially…
The Intriguing Challenges of Neural Network Optimization

2024年7月14日

The Intriguing Challenges of Neural Network Optimization

Optimization can be tough, especially in deep learning. While traditional methods often rely on convex problems for…

3 条评论
Mastering Regularization: The Complete Guide to All Strategies

2024年7月13日

Mastering Regularization: The Complete Guide to All Strategies

In the previous articles we learned, How regularization in machine learning is vital for preventing overfitting and…
Understanding Vector Norms: A Comprehensive Guide to L1, L2, L∞, and Beyond...

2024年7月11日

Understanding Vector Norms: A Comprehensive Guide to L1, L2, L∞, and Beyond...

In our previous article, we discussed regularization in a simplified manner. Before looking into the different types of…

2 条评论
Regularization Made Simple: A Quick and Easy Guide

2024年7月10日

Regularization Made Simple: A Quick and Easy Guide

Introduction to Regularization In machine learning, a major challenge is ensuring that a model performs well not only…

4 条评论

See all articles

Power of Regularization: Simplifying L1 and L2 Math for Everyone

Divesh Kubal

Senior Data Scientist at CrimsonAI | Expert in Generative AI, LLMs, & Deep Learning | Specializing Model Optimization, and Scalable ML Solutions | Passionate AI Blogger & Researcher

Parameter Norm Penalties

Why to penalize only weights and not biases?

Note:

L2 Parameter Regularization

Did you know? L2 regularization is known as Ridge regression or Tikhonov regularization.

Mathematical Equations of L2 Regularization

领英推荐

L1 Regularization

Mathematical Equations of L1 Regularization:

How L1 Regularization is used for Feature Selection?

Conclusion

Divesh Kubal的更多文章

社区洞察

其他会员也浏览了

Can AI write a Homomorphic Neural Network Activation function?

Layer Normalization

A Story on Transformers

Radial Basis Function Networks

Key aspects of Fuzzy Logic (FL)

Hierarchical RNNs, training bottlenecks and the future.

A Comprehensive Guide to Regularization Techniques in Machine Learning

Neural Network Quantization in Imagimob AI through a click of a button

LLM vs Translate vs Embed

Parameter Norm Penalties

Why to penalize only weights and not biases?

Note:

L2 Parameter Regularization

Did you know? L2 regularization is known as Ridge regression or Tikhonov regularization.

Mathematical Equations of L2 Regularization

领英推荐

L1 Regularization

Mathematical Equations of L1 Regularization:

How L1 Regularization is used for Feature Selection?

Conclusion

Divesh Kubal的更多文章

FlashAttention-3: Fast, Accurate, and Efficient AI with Asynchrony and Low-Precision

How Statistical Significance Can Change Your Research Outcomes

How to Craft a Perfect Hypothesis? - Ultimate Guide to NULL and RESEARCH Hypothesis

Adam, AdaGrad, RMSProp, Delta-Bar-Delta - Which Learning Rate Strategy Will Enhance Your Model?

Tired of Slow Learning? Momentum Could Be the Boost Stochastic Gradient Descent Needs!

Why Stochastic Gradient Descent (SGD) is a Game Changer?

The Intriguing Challenges of Neural Network Optimization

Mastering Regularization: The Complete Guide to All Strategies

Understanding Vector Norms: A Comprehensive Guide to L1, L2, L∞, and Beyond...

Regularization Made Simple: A Quick and Easy Guide

社区洞察

其他会员也浏览了

Can AI write a Homomorphic Neural Network Activation function?

Layer Normalization

A Story on Transformers

Radial Basis Function Networks

Key aspects of Fuzzy Logic (FL)

Hierarchical RNNs, training bottlenecks and the future.

A Comprehensive Guide to Regularization Techniques in Machine Learning

Neural Network Quantization in Imagimob AI through a click of a button

LLM vs Translate vs Embed