L1, L2 Regularization – Why needed/What it does/How it helps?
Simple is better! That’s the whole notion behind regularization.
I recently wrote about Linear Regression and Bias Variance Tradeoff, so if those topics are not clear to you, I’d suggest you to first visit those posts. In this post (also on my blog), I will be focusing on the concept of regularization, specifically limited to linear regression. Although, I must make it clear that the concept is not limited to regression but can be applied to any learning algorithm. In the posts on linear regression and bias variance tradeoff, I had written the problems associated with a learning algorithm – under-fitting and over-fitting. In real life, problem of over-fitting is what statisticians and data scientists/analysts mainly tackle with. In bias-variance tradeoff, I wrote how the “error” can be decomposed into reducible (can be reduced further) and irreducible error (cannot be reduced further). Reduced error can be further broken down into “error due to squared bias” and “error due to variance”. I also wrote about the assumptions that linear regression (read ordinary least squares) takes into account and the good practices before doing a linear regression.
Fig 1. Linear Regression Equation (OLS) equation and Solution
Why Regularization?
In general, if the relationship between the response (Y) and the predictors (X) is approximately linear, the least squares estimates will have low bias. If the number of observations (N) is much larger than the number of predictors (sometimes denoted as p/P), then the least squares estimates tend to also have low variance, and hence will perform really good on the test set with low error and high R2. On the other hand, if N is not much larger than P, then there can be a lot of variability in the least squares fit, resulting in over-fitting and therefore resulting in poor predictions. And if P > N, then there is no longer a unique least squares coefficient estimate: the variance is infinite so the method cannot be used at all. That is an important limitation with OLS. When OLS was developed, the statisticians dealt with small number of predictors whereas today having thousands of predictors is not a big deal. Obviously, all of these parameters are not important in the model and only a handful of them would have predictive power in the model. One way to solve this problem is to reduce/remove predictors that are not so important in the model – this is called shrinking. By shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible (depending on the regularization parameter value) increase in bias. This can lead to substantial improvements in the accuracy with which we can predict the response for observations not used in model training.
Ok, What is over-fitting?
What is this problem of overfitting and why it arises? Let’s take a sample problem – assume that you want to predict the height of person based on his/her age using a linear regression model. Your response variable (Y) is height and the predictor or independent variable (X) is age. What do you think how the model will perform? Not so good, right? Well, it’s too simple.
Next – you have additional variables that you can add to the model – weight, sex, location. Well, what you did here is add complexity to your data and might have increased the prediction accuracy of the model. Now, you add more variables to your model – height of parents, profession of parents, social background, number of children, weight, number of books, preferred color, best meal, last holidays destination and so on and so forth. That’s one too many variables – and mostly not all of them can explain someone’s height. Your model might do good but it is probably overfitting, i.e. it will probably have poor prediction and generalization power: it sticks too much to the training data and the model has probably learned the background noise while being fit. When tried on an unseen data, this model will perform poorly.
How to solve this Over-fitting Problem?
It is here where the regularization technique comes in handy. There are two main techniques used with linear regression (L1 or Lasso) and (L2 or Ridge). In general, a regularization term is introduced to a loss/cost function.
What’s a cost function?
Whenever a model is trained on a training data and is used to predict values on a testing set, there exists a difference between the true and predicted values. The the closer the predicted values to their corresponding real values, the better the model. That means, a cost function is used to measure how close the predicted values are to their corresponding real values. The function can be minimized or maximized, given the situation/problem. For example, in case of ordinary least squares (OLS), the cost function would be:
Fig 2. Cost function for OLS
where, J denotes the cost function, m is the number of observations in the dataset, h(x) is the predicted value of the response and y is the true value of the response. In case of OLS, the goal is to minimize this. There are two ways to solve OLS – one using closed form solution, requires doing matrix inverse, multiplication etc. (shown in fig 3) and the other using gradient descent and cost function (fig 4).
Fig 3. Closed form solution
Fig 4. Cost Function convergence and gradient descent
A cost function is also denoted by L(X,Y). In addition to the cost function in fig 2., an additional parameter is added which varies depending on L1 or L2 regression. In general, the cost function becomes
L(X,Y) + λ (Additional Parameter)
This helps to avoid overfiting and will perform, at the same time, features selection for certain regularization norms (only L1). What this lambda does is amazing – it manages the complexity of the model. As the lambda is increased, variance is reduced and bias is added in the model, so getting the right value of the lambda is essential. Cross-validation is generally used to estimate the value of lambda.
Regularization in Layman’s Terms
Let’s take it to a simpler dimension and understand what’s going on. Growing up as kid is an amazing time for all of us, however parents don’t share the same opinion, I believe. Parent oftentimes have to make decision about how much flexibility should be given to the children. Too much restriction may suppress the imagination/ character development. On the other hand, too much flexibility may spoil their future and make them careless. Parents have to optimize somewhere in the middle – which we can call “regularized flexibility“, which is to give enough flexibility added with regularization.
Parents try to fill the expectation with flexibility to watch TV, but with hours defined, no late night (regularization); buy comics’ books/ice-cream/chocolate but has to be shared with siblings (regularization); buy expensive books and copies only to finish homework on time (regularization).
Same is applied with the learning algorithm. The number of parameters taken in the model for training using observed data is more than the required numbers to represent the problem, which helps to generalize the problem well. But for remedy of over-fitting regularization techniques are added along with the parameters. In the problem that was discussed above, not all variables are required to predict the height of a person. What regularization does is gives more importance to only important parameters and ignore others, thus reducing the complexity of the model.
Ridge Regression (L2)
Fig 5. Cost function for Ridge Regression
In addition to the cost function we had in case of OLS, there is an additional term added (in red), which is the regularization term. θ is the norm of the coefficients and for ridge regression, the addition is of λ (the regularization parameter) and θ^2 (norm of coefficient squared). The addition of regularization term penalizes big coefficients and tries to minimize them to zero, although not making them exactly to zero. This means that if the θ’s take on large values, the optimization function is penalized. We would prefer to take smaller θ’s, or θ’s that are close to zero to drive the penalty term small. It is also called L2 regularization and it pushes weight with force vectors perpendicular to the surface of a sphere, so they’re likely to be pretty similar, since most of the volume of the sphere lies in areas where weights are similar.
Why does Ridge Regression provide better result?
It seeks to reduce the MSE by adding some bias and, at the same time, reducing the variance. Remember high variance correlates to a over-fitting model.
When to use Ridge Regression?
When there are many predictors (with some col-linearity among them) in the dataset and not all of them have the same predicting power, L2 regression can be used to estimate the predictor importance and penalize predictors that are not important. One issue with co-linearity is that the variance of the parameter estimate is huge. In cases where the number of features are greater than the number of observations, the matrix used in the OLS may not be invertible but Ridge Regression enables this matrix to be inverted.
Consider a simple example where the number of predictors is 2, the eclipses in the red is the contour plot of the OLS and the circle. The solution for this is the black spot at the center of the red eclipses, which is also the minimum of the function. Another contour plot in blue is that of the regularized term, denoted by λ(θ1^2+θ2^2). In case of ridge regression, the objective is to reduce the sum of these two values which will come when these two contours meet. The larger the penalty (λ), the “more narrow” blue contours we get, and then the plots meet each other in a point closer to zero. The smaller the penalty (λ), the contours expand, and the intersection of blue and red plots comes closer to the center of the red circle (non-penalized solution).
Lasso Regression (L1)
One of the things that Ridge can’t be used is variable selection since it retains all the predictors. Lasso on the other hand overcomes this problem by forcing some of the predictors to zero.
Lasso has one small change in the cost equation, instead of the squared of the norm, it takes the absolute value.
Fig 7. Cost Equation for Lasso Regression & Geometric interpretation for Lasso Regression
The lasso performs shrinkage, so that there are “corners” in the constraint, which in two dimensions corresponds to a rhombus. If the sum of squares “hits” one of these corners, then the coefficient corresponding to the axis is shrunk to zero. As the number of predictors increases, the multidimensional rhombus has an increasing number of corners, and so it is highly likely that some coefficients will be set equal to zero. Hence, the lasso performs shrinkage and (effectively) subset selection. In contrast with subset selection, Lasso performs a soft thresholding: as the smoothing parameter is varied, the sample path of the estimates moves continuously to zero.
How to find the Right value of Lambda?
As always, cross-validation is used to estimate the appropriate value of lambda.
To read about some examples of codes in Python & R, please visit the post on my blog at https://analyticsbot.ml/2017/01/l1-l2-regularization-why-neededwhat-it-doeshow-it-helps/
Lead Data Scientist
5 年Well Written!
Lead Data Analyst - Global Workplace Services at AT&T
6 年It is really helpful! I googled L1 and L2 regularization and found this article is of most help to me! Thank you Ravi for your sharing!
Senior Data Engineer | Data and Machine Learning Leader
7 年Hi Ravi, Thanks for article, the blog link seems broken. Could you post the updated link please?
Student at The University of Toledo
7 年Thank you