REGULARLY USE REGULARIZATION : REGULARIZATION IN MACHINE LEARNING
The word regularize means to make things regular or acceptable. Regularization refers to techniques that are used to calibrate machine learning models in order to minimize the adjusted loss function by adding an additional penalty term in the error function to prevent overfitting or underfitting. There are 3 types of regularization in machine learning i.e. L1 regularization or Least Absolute Shrinkage and Selection Operator (LASSO) Regression, L2 regularization or Ridge regression and Elasticnet regularization. NOTE: Standardization or normalization is required before any regularization. Lambda(λ) is a Hyperparameter known as regularization constant and it is greater than zero. It mainly regularizes or reduces the coefficient of features toward zero. Ridge regression is one of the types of linear regression in which a small amount of bias is introduced so that we can get better long-term predictions. Ridge regression is used to reduce the complexity of the model. The amount of bias added to the model is called Ridge Regression penalty.
RIDGE REGRESSION = SUM OF SQUERED RESIDUALS + λ * SLOPE 2
When λ = 0, the penalty term has no effect. It means it returns the residual sum of the square as loss function which we choose initially i.e. we will get the same coefficients as simple linear regression. When 0 < λ < ∞, for simple linear regression, the ridge regression coefficient will be somewhere between 0 and 1. When λ = ∞, the ridge regression coefficient will be zero because the modified loss function will ignore the core loss function and minimize coefficients square and eventually end up taking the parameter’s value as 0.
Ridge regression includes all the features present in the model. When we have the independent variables which are having high collinearity (multicollinearity) between them, at that time general linear or polynomial regression will fail so to solve such problems, Ridge regression can be used. Ridge not helps in Feature Selection: It decreases the complexity of a model but does not reduce the number of independent variables since it never leads to a coefficient being zero rather only minimizes it. Hence, this technique is not good for feature selection.
Its disadvantage is model interpretability since it will shrink the coefficients for least important predictors, very close to zero but it will never make them exactly zero. In other words, the final model will include all the independent variables, also known as predictors.
Lasso Regression adds “absolute value of magnitude†of coefficient as penalty term to the loss function. NOTE: during Regularization the output function(y_hat) does not change. The change is only in the loss function.
LASSO = SUM OF SQUERED RESIDUALS + λ * | SLOPE |
When λ = 0, we will get the same coefficients as simple linear regression. When λ = ∞, the lasso regression coefficient will be zero. When 0 < λ < ∞, for simple linear regression, the lasso regression coefficient will be somewhere between 0 and 1.
Since it takes absolute values, hence, it can shrink the slope to 0. Lasso method also performs feature selection and is said to yield sparse models. If the number of predictors is greater than the number of data points, Lasso will pick at most n predictors as non-zero, even if all predictors are relevant. If there are two or more highly collinear variables then LASSO regression selects one of them randomly which is not good for the interpretation of the model.
Elastic net regression: It is a combination of Ridge and Lasso regression.
Elasticnet = SUM OF SQUERED RESIDUALS + λ1 * |variable1| + |variable2|........... + λ2 * variable12 + variable22.........
When many variables are present, and we can’t determine whether to use Ridge or Lasso regression, then the Elastic-Net regression is your safe bet.
For regression use L1,L2 for Decision tree bagging,boosting,stacking
#statistics #datascience #machinelearning Krish Naik Sunny Savita