L1, L2 Regularization – Why needed/What it does/How it helps?

L1, L2 Regularization – Why needed/What it does/How it helps?

Simple is better! That’s the whole notion behind regularization.

I recently wrote about Linear Regression and Bias Variance Tradeoff, so if those topics are not clear to you, I’d suggest you to first visit those posts. In this post (also on my blog), I will be focusing on the concept of regularization, specifically limited to linear regression. Although, I must make it clear that the concept is not limited to regression but can be applied to any learning algorithm. In the posts on linear regression and bias variance tradeoff, I had written the problems associated with a learning algorithm – under-fitting and over-fitting. In real life, problem of over-fitting is what statisticians and data scientists/analysts mainly tackle with. In bias-variance tradeoff, I wrote how the “error” can be decomposed into reducible (can be reduced further) and irreducible error (cannot be reduced further). Reduced error can be further broken down into “error due to squared bias” and “error due to variance”. I also wrote about the assumptions that linear regression (read ordinary least squares) takes into account and the good practices before doing a linear regression.

Fig 1. Linear Regression Equation (OLS) equation and Solution

Why Regularization?

In general, if the relationship between the response (Y) and the predictors (X) is approximately linear, the least squares estimates will have low bias. If the number of observations (N) is much larger than the number of predictors (sometimes denoted as p/P), then the least squares estimates tend to also have low variance, and hence will perform really good on the test set with low error and high R2. On the other hand, if N is not much larger than P, then there can be a lot of variability in the least squares fit, resulting in over-fitting and therefore resulting in poor predictions. And if P > N, then there is no longer a unique least squares coefficient estimate: the variance is infinite so the method cannot be used at all. That is an important limitation with OLS. When OLS was developed, the statisticians dealt with small number of predictors whereas today having thousands of predictors is not a big deal. Obviously, all of these parameters are not important in the model and only a handful of them would have predictive power in the model. One way to solve this problem is to reduce/remove predictors that are not so important in the model – this is called shrinking. By shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible (depending on the regularization parameter value) increase in bias. This can lead to substantial improvements in the accuracy with which we can predict the response for observations not used in model training.

Ok, What is over-fitting?

What is this problem of overfitting and why it arises? Let’s take a sample problem – assume that you want to predict the height of person based on his/her age using a linear regression model. Your response variable (Y) is height and the predictor or independent variable (X) is age. What do you think how the model will perform? Not so good, right? Well, it’s too simple.

Next – you have additional variables that you can add to the model – weight, sex, location. Well, what you did here is add complexity to your data and might have increased the prediction accuracy of the model. Now, you add more variables to your model – height of parents, profession of parents, social background, number of children, weight, number of books, preferred color, best meal, last holidays destination and so on and so forth. That’s one too many variables – and mostly not all of them can explain someone’s height. Your model might do good but it is probably overfitting, i.e. it will probably have poor prediction and generalization power: it sticks too much to the training data and the model has probably learned the background noise while being fit. When tried on an unseen data, this model will perform poorly.

How to solve this Over-fitting Problem?

It is here where the regularization technique comes in handy. There are two main techniques used with linear regression (L1 or Lasso) and (L2 or Ridge). In general, a regularization term is introduced to a loss/cost function.

What’s a cost function?

Whenever a model is trained on a training data and is used to predict values on a testing set, there exists a difference between the true and predicted values. The the closer the predicted values to their corresponding real values, the better the model. That means, a cost function is used to measure how close the predicted values are to their corresponding real values. The function can be minimized or maximized, given the situation/problem. For example, in case of ordinary least squares (OLS), the cost function would be:

Fig 2. Cost function for OLS

where, J denotes the cost function, m is the number of observations in the dataset, h(x) is the predicted value of the response and y is the true value of the response. In case of OLS, the goal is to minimize this. There are two ways to solve OLS – one using closed form solution, requires doing matrix inverse, multiplication etc. (shown in fig 3) and the other using gradient descent and cost function (fig 4).

Fig 3. Closed form solution

Fig 4. Cost Function convergence and gradient descent

A cost function is also denoted by L(X,Y). In addition to the cost function in fig 2., an additional parameter is added which varies depending on L1 or L2 regression. In general, the cost function becomes

L(X,Y) + λ (Additional Parameter)

This helps to avoid overfiting and will perform, at the same time, features selection for certain regularization norms (only L1). What this lambda does is amazing – it manages the complexity of the model. As the lambda is increased, variance is reduced and bias is added in the model, so getting the right value of the lambda is essential. Cross-validation is generally used to estimate the value of lambda.

Regularization in Layman’s Terms

Let’s take it to a simpler dimension and understand what’s going on. Growing up as kid is an amazing time for all of us, however parents don’t share the same opinion, I believe. Parent oftentimes have to make decision about how much flexibility should be given to the children. Too much restriction may suppress the imagination/ character development. On the other hand, too much flexibility may spoil their future and make them careless. Parents have to optimize somewhere in the middle – which we can call “regularized flexibility“, which is to give enough flexibility added with regularization.


Parents try to fill the expectation with flexibility to watch TV, but with hours defined, no late night (regularization); buy comics’ books/ice-cream/chocolate but has to be shared with siblings (regularization); buy expensive books and copies only to finish homework on time (regularization).

Same is applied with the learning algorithm. The number of parameters taken in the model for training using observed data is more than the required numbers to represent the problem, which helps to generalize the problem well. But for remedy of over-fitting regularization techniques are added along with the parameters. In the problem that was discussed above, not all variables are required to predict the height of a person. What regularization does is gives more importance to only important parameters and ignore others, thus reducing the complexity of the model.

Ridge Regression (L2)

Fig 5. Cost function for Ridge Regression

In addition to the cost function we had in case of OLS, there is an additional term added (in red), which is the regularization term. θ is the norm of the coefficients and for ridge regression, the addition is of λ (the regularization parameter) and θ^2 (norm of coefficient squared). The addition of regularization term penalizes big coefficients and tries to minimize them to zero, although not making them exactly to zero. This means that if the θ’s take on large values, the optimization function is penalized. We would prefer to take smaller θ’s, or θ’s that are close to zero to drive the penalty term small. It is also called L2 regularization and it pushes weight with force vectors perpendicular to the surface of a sphere, so they’re likely to be pretty similar, since most of the volume of the sphere lies in areas where weights are similar.

Why does Ridge Regression provide better result?

It seeks to reduce the MSE by adding some bias and, at the same time, reducing the variance. Remember high variance correlates to a over-fitting model.

When to use Ridge Regression?

When there are many predictors (with some col-linearity among them) in the dataset and not all of them have the same predicting power, L2 regression can be used to estimate the predictor importance and penalize predictors that are not important. One issue with co-linearity is that the variance of the parameter estimate is huge. In cases where the number of features are greater than the number of observations, the matrix used in the OLS may not be invertible but Ridge Regression enables this matrix to be inverted.

Consider a simple example where the number of predictors is 2, the eclipses in the red is the contour plot of the OLS and the circle. The solution for this is the black spot at the center of the red eclipses, which is also the minimum of the function. Another contour plot in blue is that of the regularized term, denoted by λ(θ1^2+θ2^2). In case of ridge regression, the objective is to reduce the sum of these two values which will come when these two contours meet. The larger the penalty (λ), the “more narrow” blue contours we get, and then the plots meet each other in a point closer to zero. The smaller the penalty (λ), the contours expand, and the intersection of blue and red plots comes closer to the center of the red circle (non-penalized solution).

Lasso Regression (L1)

One of the things that Ridge can’t be used is variable selection since it retains all the predictors. Lasso on the other hand overcomes this problem by forcing some of the predictors to zero.

Lasso has one small change in the cost equation, instead of the squared of the norm, it takes the absolute value.

Fig 7. Cost Equation for Lasso Regression &  Geometric interpretation for Lasso Regression

The lasso performs shrinkage, so that there are “corners” in the constraint, which in two dimensions corresponds to a rhombus. If the sum of squares “hits” one of these corners, then the coefficient corresponding to the axis is shrunk to zero. As the number of predictors increases, the multidimensional rhombus has an increasing number of corners, and so it is highly likely that some coefficients will be set equal to zero. Hence, the lasso performs shrinkage and (effectively) subset selection. In contrast with subset selection, Lasso performs a soft thresholding: as the smoothing parameter is varied, the sample path of the estimates moves continuously to zero.

How to find the Right value of Lambda?

As always, cross-validation is used to estimate the appropriate value of lambda.

To read about some examples of codes in Python & R, please visit the post on my blog at https://analyticsbot.ml/2017/01/l1-l2-regularization-why-neededwhat-it-doeshow-it-helps/

Abhijit Sahoo

Lead Data Scientist

5 年

Well Written!

回复
Linlin Huang

Lead Data Analyst - Global Workplace Services at AT&T

6 年

It is really helpful! I googled L1 and L2 regularization and found this article is of most help to me! Thank you Ravi for your sharing!

回复
Santosh Kangane

Senior Data Engineer | Data and Machine Learning Leader

7 年

Hi Ravi, Thanks for article, the blog link seems broken. Could you post the updated link please?

回复
Rawiyah Alraddadi

Student at The University of Toledo

7 年

Thank you

回复

要查看或添加评论,请登录

Ravi Shankar的更多文章

  • How I started with Deep Learning?

    How I started with Deep Learning?

    Note: In this post, I talk about my learning in deep learning, the courses I took to understand, and the widely used…

    4 条评论
  • Measuring Text Similarity in Python

    Measuring Text Similarity in Python

    Note: This article has been taken from a post on my blog. A while ago, I shared a paper on LinkedIn that talked about…

    1 条评论
  • Getting started with Apache Spark

    Getting started with Apache Spark

    If you are in the big data space, you must have head of these two Apache Projects – Hadoop & Spark. To read more on…

  • Intuitive Explanation of "MapReduce"

    Intuitive Explanation of "MapReduce"

    How many unique words are there in this sentence which you are reading? The answer which you will say is 12 (Note: word…

  • Getting started with Hadoop

    Getting started with Hadoop

    Note: This is a long post. It talks about big data as a concept, what is Apache Hadoop, "Hello World" program of Hadoop…

    7 条评论
  • What is the Most Complex thing in the Universe?

    What is the Most Complex thing in the Universe?

    What is the most complex piece of creation (natural/artificial) in this universe? Is it the human brain? But if the…

    11 条评论
  • Automate Finding Items on Craigslist || Python & Selenium to the Rescue

    Automate Finding Items on Craigslist || Python & Selenium to the Rescue

    If necessity is the mother of invention, then laziness is sometimes its father! Craigslist, especially in the United…

    7 条评论
  • Getting Started with Python!

    Getting Started with Python!

    Note: This post is only for Python beginners. If you are comfortable with it, there might be nothing new to learn.

    2 条评论
  • Bias-Variance Tradeoff: What is it and why is it important?

    Bias-Variance Tradeoff: What is it and why is it important?

    What is Bias- Variance Tradeoff? The bias-variance tradeoff is an important aspect of machine/statistical learning. All…

    7 条评论
  • Understanding Linear Regression

    Understanding Linear Regression

    In my recent post on my blog, I tried to present my understanding of linear regression with charts and tables. Here's…

社区洞察

其他会员也浏览了