Intuitive and Visual Explanation on the differences between L1 and L2 regularization

Intuitive and Visual Explanation on the differences between L1 and L2 regularization

This blog is quite easy in math. But it is still interesting.

The L1 and L2 regularization are widely used methods to control the model complexity and restrict over-fitting. There are some interesting comparisons between the L1 and L2 regularization. I find these visual comparisons and their explanations very easy to understand.

The theory and some images can be found in Trevor Hastie's classic book 'The Elements of Statistic Learning'.

1. Why we need regularization?

To start, let's use linear regression as an example. Supposing there was an unclear relationship between Y and a bunch of other factors, Factor1, Factor2, Factor3, ... , we don't know which one or ones of these factors will have influences on Y. But we want to predict Y. We decided to use linear regression to approximate Y.

Beta hats are the estimated coefficients.

Unfortunately,

  1. The observed Y has some random noise in it. It is not perfect
  2. For the observed Ys , the more Factors we use, the more complex model we will build, and the better estimation we will have (smaller MSE, the 'n' is the number of observed Ys, which is the constant value and won't affect out analysis. Thanks Matthew for pointing out)

to approximate observed Ys, including approximating Ys random noise part.

What we really want is the connection between real Ys and the Factors, not the relationship between random noise part and Factors. We need to restrict the model complexity, to suppress the model's intention to approximate the random noise part. One way is to put more control on the value of the single coefficients. If the coefficient is restricted to a small number, its corresponding factor may have limited impact on the outcome (Y), forcing the model to use other factors to explain the variance of Y's.

But which factor should we depend upon? which one should we suppress down? Will we pick up the wrong factor to suppress, which later turns out to be the key factor?

We chose to avoid manipulating single coefficients, but to give a quota for all these coefficients, and give them more freedom and flexibility to pick up the most influential factors, and to eliminate the least ones.

That is the intuition of regularization.

2. L1 vs L2

To practice the regularization, or to say, to define the quota to regulate the coefficients, we have L1, Lasso

The Lagrangian form of which, is:

where alpha is the strength of the regularization. It is different from the quota t. A higher alpha has stronger restrictions on the coefficients, corresponding to a lower quota t.

and L2, Ridge method

the Lagrangian form of which is:

The difference in the definition of L1 and L2 is: L1 controls the first order summation, while L2 has restriction on the summation of the second order of coefficients.

3. Difference between L1 and L2 on restrict model behavior

While shrinking the quota on coefficients, models might locate the quota to factors who can better explain the variance in Y. In this way, both L1 and L2 suppress some of the factors and depend on others to approximate Y.

The difference is that: while shrinking the quota, L1 tends to cut off some factors by turning their coefficients to zero, while L2 tend s to shrinking these coefficients to a tiny number (none zero), keep some of their influence on Y.

Let's see an example, which is derived from analyticscvidhya with some modifications.

First we generate a bunch of data, the underlying relationship should be y = sin(x), but with some random noise term. Let's see how to use linear regression to approximate the true relation, with the help of L1 and L2 regularization.

3.1 Observed data

import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 10

x = np.array([1.4*i*np.pi/180 for i in range(0,300,4)])
np.random.seed(20#random 

y = np.sin(x) + np.random.normal(0,0.2,len(x)) 
data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])
plt.plot(data['x'],data['y'],'.')

Figure 1: observed data (generated) for the sine function.

3.2 linear regression result, with different model complexity

After regression y to x's 0-1th order, 0-3rd order, 0-6th order, 0-8th order, 0-11th order, 0-14th order, the regression line fit observed curve better and better. However, there is overfitting after the highest power order exceeds 6. The linear regression model tried to fit the random noise and produced local distortion on the sine curve.

Figure 2: Linear regression with x's power term. From 0-1st, to 0-14th power. There is overfitting when power order exceeds 6.

3.3 Linear regression with ridge regularization (L2)

Let's try regularize the linear regression with L2 method.

Recall the Lagrangian form

The alpha term is regularization strength. Larger alpha value means stronger restriction on coefficients, with more factors' coefficients shrinking to tiny number.

The base linear regression function consists of x power term from 0 power to 14th power order, shows local over-fitting. With the increase of alpha, the over-fitting is alleviated, showing that ridge regularization works.

Figure 3: With Ridge regularization, the over-fitting is eased. However, with too much restriction (alpha >0.01), linear regression can not capture the sine relation anymore.

3.4 Linear regression with Lasso regularization (L1)

The linear regression with Lasso regularization has similar results as it is with Ridge regularization. However, the interesting difference shows up when alpha >=1 or >=5. Recall that with strong L2 regularization (Ridge), the last subplot in Figure 3 still shows some complex relationship, indicating the existence of influences from multiple X's power terms, and the coefficients of these terms shrink to a tiny but none zero value.

With L1 regularization, However, it is not the same. In the last subplot of Figure 4, the regression line is a horizontal line, like the heart beat line of a dead man. It is dead, as the strong L1 regularization cut-off all factors, cutting these coefficients to 0, with constant term left. That's why the horizontal line exists.

Figure 4: Linear regression with Lasso regularization. be careful with the difference on Figure 3 and 4, especially when alpha > =1.


4. Why are they different? How to explain it?

Figure 5: illustration for finding beta1, beta2 to minimize MSE under restrictions on beta1, beta2

I believe Figure 5 is the most important one to understand the difference between L1 L2 regularization. However, for beginners, it may take some efforts to fully understand what figure 5 says. Let me explain one by one.

In Figure 5, to simplify without loss of generality, we limit the number of coefficients to 2 parameters, beta1 and beta2. Our goal is to find the minimal MSE, under the restrictions on beta1 and beta2.

4.1 The shape of MSE

The MSE is the sum of squares (second order), its visualization should be a bunch of ellipses.

Figure 6: The beta1, beta2 are variables, and all observed Factors and Ys will determine the shape of the ellipse. Each ellipse represents an MSE.

At this moment, the shape (not the size) of the ellipse is fixed by the observed Factors (X) and Ys. In Figure 6, all ellipses have the same short axis to long axis ratio. The size of the ellipse is not fixed yet, it is controlled by beta1 and beta2. Any points on the same ellipse (beta1, beta2) results in the same ellipse size. These lines are the iso-size line or contour line. By adjusting beta1 and beta2, we can get smaller or larger size of ellipse, or in other words, smaller or larger MSE.

Notes: The blue MSE ellipse in Figure 5 are basically the same as they are in Figure 6, except that they have been moved and rotated, it is more 'real'.

Figure 7: The shape of MSE.


4.2 The shape of L1 and L2 restrictions

As L1 is restriction on the summation of the absolute value (first order) of betas, its available area should be a square with straight edge line. The L2 is restriction on the summation of the second order of betas, its available area should be a round circle.

Figure 8: The shape of available area in L1 regularization and L2 regularization

4.3 Find the smallest MSE under restriction

With restrictions on the choices of beta1 and beta2, we need find the smallest MSE among them. The smallest reachable MSE is the smallest ellipse (blue) tangent to the available area (red).

Figure 9: The minimum MSE can be achieved by find the smallest ellipse (blue) tangent to the available area (red).

4.4 Shrinking the restricted area, and find the tangent point for MSE

Just now we've fixed the quota or t, trying to find the corresponding (beta1, beta2) for the smallest MSE.

Now, let's try shrink the t, and see how the corresponding (beta1, beta2) behave. (Now the new MSE may be bigger, it is fine, remember, we need to fight over-fitting!)

Figure 10. Shrink t, and check how the corresponding beta1, beta2 behave.

With the decease of t, both methods get smaller beta1 and beta2. Here comes the difference between L1 and L2: the tangent point for L1 cases is more likely to be achieved on the axis (some parameter be reduced to zero), while the tangent point for L2 cases is more likely to be got on a none-axis point (none zero points).

The math behind it may due to the fact that the first order derivative is a constant value for L1's straight edge, except for its sharp vertex which doesn't have the definition of the first order derivative. While in L2, the curve (it is the circle) is smooth and always has the first order derivative on any points on the red circle. To find the tangent point is to match the points where they share the same slope (first order derivative) on both the edge of available area (red diamond) and the target ellipse (blue ellipse). When we shrink the t and construct a new available area (red), the area in L2 method (red circle) can always find some points on its curve satisfying our requirement, the points may be on the axis (zero for some beta), maybe not (none zero for some beta). When it comes to L1 method, the slope on the straight edge of available area (red diamond) is constant. If 't' is relatively larger, it means beta1 and beta2 has wider range on the contour line to match the point with the same slope. Yeah ~ happy ending! But if the 't' is small enough, there is no chance for the L1 method to match the points in such a tiny range! The only possible choice left is the vertex, where there is no definition of slope, so it can be any slope and match any points on the blue ellipse!!!

That is why L1 always have tangent point on the axis (or say vertex) and dropout coefficients by setting them to ZERO after shrinking the regulation, and L2 always results a none-zero but tiny coefficients!

5. End

PS I: this explanation is based on visual and the none-Lagrangian form. It is not the only one. But I think it is the most intuitive and easiest (maybe) one to explain the difference between the properties of L1 and L2.

PS II: Pure L1 and Pure L2 may not be very useful now. I've read several great applications (including XGBoost) of elastic net, which is the mixed of L1 and L2.

PS III: the alpha in the Lagrangian form is not the quota 't'. Alpha is the strength of the regularization. The larger alpha should corresponding to a smaller 't', though they are not strictly connected.

PS IV: if you are good at statistics, math, please check out the book 'The Elements of Statistic Learning'. It's the classic machine learning theory textbook from the perspective of statistics and math. It's much more hard-core and more theoretical than Andrew Ng's Machine Learning class. However, you'll find gold mines and 'aha' moments while reading this book.

I've wrote several blogs on theoretical derivation. Maybe next time I'll post a more practical one, with more coding. Reinforcement Learning can be a good topic. Using RL to play games is awesome. Good! Next time, reinforcement learning on game playing!

Sundaramoorthiraj Selvaraj

Enterprise-Wide Solution Architect| Machine Learning | Visionary Leader | Researcher| AVEVA PI |Alarm Rationalization Specialist | OT Cyber Security Risk Assessment | Data Analytics| Program Management|

1 年

Excellent explanation! Thanks

回复
Cici (Xiyin) Xu

Data Science & Math-Econ @ UC San Diego

1 年

Very helpful explanation! Thank you for sharing:)

回复
Poorani Ravindhiran

Actively seeking full-time opportunities in Perception and Robot Localization | 4 years of work experience | UMich Robotics Graduate

1 年

Clearly explained??

回复
Sirish Gambhira

MS CS @ Georgia Tech | Microsoft Research, Microsoft

3 年

Loved this article :)

回复

Explained? so lucidly, wonderful article.

回复

要查看或添加评论,请登录

Xiaoli C.的更多文章

  • Reinforcement Learning and game player: big picture of the method, the information backpropagation in iterations

    Reinforcement Learning and game player: big picture of the method, the information backpropagation in iterations

    Reinforcement Learning (RL) has been a hot topic since AlphaGo beat human go player. It is very interesting and…

  • How to prove it in math: why deeper decision trees will never have higher expected cross entropy?

    How to prove it in math: why deeper decision trees will never have higher expected cross entropy?

    What we will discuss here is intuition v.s math derivation, and we are very lucky on this topic, our intuition is…

    11 条评论
  • 你的未来,与你孩子的未来发展,会被AI打断吗?

    你的未来,与你孩子的未来发展,会被AI打断吗?

    硅兔赛跑 (ID: sv_race) 硅谷是一种思考态度 文|陈晓理@数据应用学院 本文为数据应用学院为硅兔赛跑的特稿…

  • 新的一年,数据科学求职者应该做的几件事

    新的一年,数据科学求职者应该做的几件事

    悟以往之不谏,知来者之可追 ——陶潜《归去来兮》…

    1 条评论
  • 机会+八卦|Snapchat进入中国

    机会+八卦|Snapchat进入中国

    日前,数据应用学院的老学员发来消息,风靡欧美年轻群体的社交产品,‘阅后即焚’Snapchat终于开始在国内组建研发团队,坐标深圳,并可以为国内的学员提供内推。 招聘启事如下: Snapchat中国第一批骨干团队招聘…

  • 再论数据科学竞赛中的Data Leakage

    再论数据科学竞赛中的Data Leakage

    越来越多的数据爱好者把注意力放在了数据竞赛上,像Kaggle数据竞赛。这类数据竞赛中,有时会遇到Data Leakage。而大部分人对Data Leakage的概念理解都是错误的。这次,我们来梳理一下Data…

  • 数据应用学院如何上榜北美Top Data Camp

    数据应用学院如何上榜北美Top Data Camp

    感谢众位老师和助教的不懈努力,数据应用学院(Data Application Lab)被美国著名科技期刊Tech Beacon列入北美Top Data Camp,与老牌劲旅Data Incubator…

  • 关于数据科学竞赛的一点思考

    关于数据科学竞赛的一点思考

    我以前在国内是做数学竞赛的,保送到大学之后,又开始参加数学建模比赛,到美国后逐渐往数据科学方向转型。在数据应用学院担任助教和竞赛协调后,组织参加Kaggle竞赛了很多次,这次又跟企业合办Fintech数据科学竞赛。从这些竞赛的导向来看,很多…

    1 条评论
  • 理论上说,什么是数据工程师,什么是数据科学家

    理论上说,什么是数据工程师,什么是数据科学家

    这么多公司需要大数据人才,小伙伴们也纷纷跃跃欲试投身这场数据革命中。可到底大数据有哪些岗位需求呢?对用人的要求是怎样的呢?我们今天来仔细看一看。 数据行业里面,跟数据有关的岗位一般有三种:1. Data Analysis, 2.

社区洞察

其他会员也浏览了