Intuitive and Visual Explanation on the differences between L1 and L2 regularization
This blog is quite easy in math. But it is still interesting.
The L1 and L2 regularization are widely used methods to control the model complexity and restrict over-fitting. There are some interesting comparisons between the L1 and L2 regularization. I find these visual comparisons and their explanations very easy to understand.
The theory and some images can be found in Trevor Hastie's classic book 'The Elements of Statistic Learning'.
1. Why we need regularization?
To start, let's use linear regression as an example. Supposing there was an unclear relationship between Y and a bunch of other factors, Factor1, Factor2, Factor3, ... , we don't know which one or ones of these factors will have influences on Y. But we want to predict Y. We decided to use linear regression to approximate Y.
Beta hats are the estimated coefficients.
Unfortunately,
- The observed Y has some random noise in it. It is not perfect
- For the observed Ys , the more Factors we use, the more complex model we will build, and the better estimation we will have (smaller MSE, the 'n' is the number of observed Ys, which is the constant value and won't affect out analysis. Thanks Matthew for pointing out)
to approximate observed Ys, including approximating Ys random noise part.
What we really want is the connection between real Ys and the Factors, not the relationship between random noise part and Factors. We need to restrict the model complexity, to suppress the model's intention to approximate the random noise part. One way is to put more control on the value of the single coefficients. If the coefficient is restricted to a small number, its corresponding factor may have limited impact on the outcome (Y), forcing the model to use other factors to explain the variance of Y's.
But which factor should we depend upon? which one should we suppress down? Will we pick up the wrong factor to suppress, which later turns out to be the key factor?
We chose to avoid manipulating single coefficients, but to give a quota for all these coefficients, and give them more freedom and flexibility to pick up the most influential factors, and to eliminate the least ones.
That is the intuition of regularization.
2. L1 vs L2
To practice the regularization, or to say, to define the quota to regulate the coefficients, we have L1, Lasso
The Lagrangian form of which, is:
where alpha is the strength of the regularization. It is different from the quota t. A higher alpha has stronger restrictions on the coefficients, corresponding to a lower quota t.
and L2, Ridge method
the Lagrangian form of which is:
The difference in the definition of L1 and L2 is: L1 controls the first order summation, while L2 has restriction on the summation of the second order of coefficients.
3. Difference between L1 and L2 on restrict model behavior
While shrinking the quota on coefficients, models might locate the quota to factors who can better explain the variance in Y. In this way, both L1 and L2 suppress some of the factors and depend on others to approximate Y.
The difference is that: while shrinking the quota, L1 tends to cut off some factors by turning their coefficients to zero, while L2 tend s to shrinking these coefficients to a tiny number (none zero), keep some of their influence on Y.
Let's see an example, which is derived from analyticscvidhya with some modifications.
First we generate a bunch of data, the underlying relationship should be y = sin(x), but with some random noise term. Let's see how to use linear regression to approximate the true relation, with the help of L1 and L2 regularization.
3.1 Observed data
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 10
x = np.array([1.4*i*np.pi/180 for i in range(0,300,4)])
np.random.seed(20) #random
y = np.sin(x) + np.random.normal(0,0.2,len(x))
data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])
plt.plot(data['x'],data['y'],'.')
Figure 1: observed data (generated) for the sine function.
3.2 linear regression result, with different model complexity
After regression y to x's 0-1th order, 0-3rd order, 0-6th order, 0-8th order, 0-11th order, 0-14th order, the regression line fit observed curve better and better. However, there is overfitting after the highest power order exceeds 6. The linear regression model tried to fit the random noise and produced local distortion on the sine curve.
Figure 2: Linear regression with x's power term. From 0-1st, to 0-14th power. There is overfitting when power order exceeds 6.
3.3 Linear regression with ridge regularization (L2)
Let's try regularize the linear regression with L2 method.
Recall the Lagrangian form
The alpha term is regularization strength. Larger alpha value means stronger restriction on coefficients, with more factors' coefficients shrinking to tiny number.
The base linear regression function consists of x power term from 0 power to 14th power order, shows local over-fitting. With the increase of alpha, the over-fitting is alleviated, showing that ridge regularization works.
Figure 3: With Ridge regularization, the over-fitting is eased. However, with too much restriction (alpha >0.01), linear regression can not capture the sine relation anymore.
3.4 Linear regression with Lasso regularization (L1)
The linear regression with Lasso regularization has similar results as it is with Ridge regularization. However, the interesting difference shows up when alpha >=1 or >=5. Recall that with strong L2 regularization (Ridge), the last subplot in Figure 3 still shows some complex relationship, indicating the existence of influences from multiple X's power terms, and the coefficients of these terms shrink to a tiny but none zero value.
With L1 regularization, However, it is not the same. In the last subplot of Figure 4, the regression line is a horizontal line, like the heart beat line of a dead man. It is dead, as the strong L1 regularization cut-off all factors, cutting these coefficients to 0, with constant term left. That's why the horizontal line exists.
Figure 4: Linear regression with Lasso regularization. be careful with the difference on Figure 3 and 4, especially when alpha > =1.
4. Why are they different? How to explain it?
Figure 5: illustration for finding beta1, beta2 to minimize MSE under restrictions on beta1, beta2
I believe Figure 5 is the most important one to understand the difference between L1 L2 regularization. However, for beginners, it may take some efforts to fully understand what figure 5 says. Let me explain one by one.
In Figure 5, to simplify without loss of generality, we limit the number of coefficients to 2 parameters, beta1 and beta2. Our goal is to find the minimal MSE, under the restrictions on beta1 and beta2.
4.1 The shape of MSE
The MSE is the sum of squares (second order), its visualization should be a bunch of ellipses.
Figure 6: The beta1, beta2 are variables, and all observed Factors and Ys will determine the shape of the ellipse. Each ellipse represents an MSE.
At this moment, the shape (not the size) of the ellipse is fixed by the observed Factors (X) and Ys. In Figure 6, all ellipses have the same short axis to long axis ratio. The size of the ellipse is not fixed yet, it is controlled by beta1 and beta2. Any points on the same ellipse (beta1, beta2) results in the same ellipse size. These lines are the iso-size line or contour line. By adjusting beta1 and beta2, we can get smaller or larger size of ellipse, or in other words, smaller or larger MSE.
Notes: The blue MSE ellipse in Figure 5 are basically the same as they are in Figure 6, except that they have been moved and rotated, it is more 'real'.
Figure 7: The shape of MSE.
4.2 The shape of L1 and L2 restrictions
As L1 is restriction on the summation of the absolute value (first order) of betas, its available area should be a square with straight edge line. The L2 is restriction on the summation of the second order of betas, its available area should be a round circle.
Figure 8: The shape of available area in L1 regularization and L2 regularization
4.3 Find the smallest MSE under restriction
With restrictions on the choices of beta1 and beta2, we need find the smallest MSE among them. The smallest reachable MSE is the smallest ellipse (blue) tangent to the available area (red).
Figure 9: The minimum MSE can be achieved by find the smallest ellipse (blue) tangent to the available area (red).
4.4 Shrinking the restricted area, and find the tangent point for MSE
Just now we've fixed the quota or t, trying to find the corresponding (beta1, beta2) for the smallest MSE.
Now, let's try shrink the t, and see how the corresponding (beta1, beta2) behave. (Now the new MSE may be bigger, it is fine, remember, we need to fight over-fitting!)
Figure 10. Shrink t, and check how the corresponding beta1, beta2 behave.
With the decease of t, both methods get smaller beta1 and beta2. Here comes the difference between L1 and L2: the tangent point for L1 cases is more likely to be achieved on the axis (some parameter be reduced to zero), while the tangent point for L2 cases is more likely to be got on a none-axis point (none zero points).
The math behind it may due to the fact that the first order derivative is a constant value for L1's straight edge, except for its sharp vertex which doesn't have the definition of the first order derivative. While in L2, the curve (it is the circle) is smooth and always has the first order derivative on any points on the red circle. To find the tangent point is to match the points where they share the same slope (first order derivative) on both the edge of available area (red diamond) and the target ellipse (blue ellipse). When we shrink the t and construct a new available area (red), the area in L2 method (red circle) can always find some points on its curve satisfying our requirement, the points may be on the axis (zero for some beta), maybe not (none zero for some beta). When it comes to L1 method, the slope on the straight edge of available area (red diamond) is constant. If 't' is relatively larger, it means beta1 and beta2 has wider range on the contour line to match the point with the same slope. Yeah ~ happy ending! But if the 't' is small enough, there is no chance for the L1 method to match the points in such a tiny range! The only possible choice left is the vertex, where there is no definition of slope, so it can be any slope and match any points on the blue ellipse!!!
That is why L1 always have tangent point on the axis (or say vertex) and dropout coefficients by setting them to ZERO after shrinking the regulation, and L2 always results a none-zero but tiny coefficients!
5. End
PS I: this explanation is based on visual and the none-Lagrangian form. It is not the only one. But I think it is the most intuitive and easiest (maybe) one to explain the difference between the properties of L1 and L2.
PS II: Pure L1 and Pure L2 may not be very useful now. I've read several great applications (including XGBoost) of elastic net, which is the mixed of L1 and L2.
PS III: the alpha in the Lagrangian form is not the quota 't'. Alpha is the strength of the regularization. The larger alpha should corresponding to a smaller 't', though they are not strictly connected.
PS IV: if you are good at statistics, math, please check out the book 'The Elements of Statistic Learning'. It's the classic machine learning theory textbook from the perspective of statistics and math. It's much more hard-core and more theoretical than Andrew Ng's Machine Learning class. However, you'll find gold mines and 'aha' moments while reading this book.
I've wrote several blogs on theoretical derivation. Maybe next time I'll post a more practical one, with more coding. Reinforcement Learning can be a good topic. Using RL to play games is awesome. Good! Next time, reinforcement learning on game playing!
Enterprise-Wide Solution Architect| Machine Learning | Visionary Leader | Researcher| AVEVA PI |Alarm Rationalization Specialist | OT Cyber Security Risk Assessment | Data Analytics| Program Management|
1 年Excellent explanation! Thanks
Data Science & Math-Econ @ UC San Diego
1 年Very helpful explanation! Thank you for sharing:)
Actively seeking full-time opportunities in Perception and Robot Localization | 4 years of work experience | UMich Robotics Graduate
1 年Clearly explained??
MS CS @ Georgia Tech | Microsoft Research, Microsoft
3 年Loved this article :)
? t
3 年Explained? so lucidly, wonderful article.