PaperScore的动态

PaperScore转发了

查看Daniel Armani, Ph.D.的档案,图片

Data Scientist | Scaffolder at PaperScore

Some data scientists like to use deep learning to tackle every problem! While Deep Neural Networks are powerful, they also have limitations. For instance, interpreting a trained model can be challenging, whereas linear regression models provide valuable theoretical insights. Often, a well-constructed linear regression model, incorporating interaction terms, transformations, polynomial terms, dummy variables, and lagged predictors, is sufficiently robust for many problems. For example, when predicting Sales for a company, you can include a variety of terms (covariates): AdSpend (advertising costs), Price (treatment), Seasonality (time of year, encoded as dummy variables), LaggedSales (Sales in the last year), AdSpend × ProductQuality, Price/CompetitorPrice, Price2, Log(LaggedSales), etc. ? However, this can lead to hundreds of terms, which can result in overfitting and multicollinearity. To avoid that, we can apply regularization techniques and select the most relevant terms. Consider the multiple linear regression model:?y = X. ?? + ?? X: an n-by-m design matrix containing the training data for ALL the potential covariates including all the predictors, their transformations, interactions, etc. It also includes a column of ones for the intercept. y: an n-by-1 vector of observed outcomes in the training data ??: an n-by-1 vector of errors ??: an m-by-1 vector of coefficients OLS can give us the optimal coefficients ??*, but it is probably overfit. To find the optimal subset of terms in the model, let's define some variables: z: an m-by-1 binary vector indicating which terms are included in the model. So, m(z) = sum(z) is the number of terms in the final regression model. Since we always need an intercept, z[1] = 1. X?? = X[ , z] : an n-by-m(z) design matrix with only the columns where z is 1. ???? : an m(z)-by-1 vector of coefficients for the chosen terms based on z. The new regression model becomes?(1). If we multiply both sides by (2) and assume that the ?? term is negligible, we obtain (3), which is the OLS result, minimizing the residual sum of squares (4). Moreover: T : the test dataset (complete matrix) T?? : the subset of the test data containing only the selected columns based on the vector z y? : the vector of observed outcomes in the test data. And the estimated outcomes are in (5) e : the vector of Out-of-Sample (OOS) residuals (for z) To find the optimal subset of terms (z*), we minimize the OOS Mean Squared Error (6). But since the size of the test set (n') is constant, this BIP problem summarizes everything: (7) This R code solves this problem using different methods: https://lnkd.in/eWJRNjmt In this example, stepwise regression (AIC) resulted in an OOSMSE value of 1.005 after 3 minutes. Lasso, a regularization method that penalizes the OLS objective for complexity, had an OOSMSE value of 1.16 almost instantly. The Genetic Algorithm achieved the lowest OOSMSE value of 0.99, but took about 12 minutes to compute... _

  • 该图片无替代文字
Daniel Armani, Ph.D.

Data Scientist | Scaffolder at PaperScore

2 个月

.... However, this can result in overfitting z* to the test data especially when m is large. To prevent this, a more robust approach is to partition (cross-validate) the data into?K?folds, and minimize the sum of the OOS SSE (Sum of Squared Errors) across all?K?test-training set pairs: Min e?'e? + e?'e? + ... + e?'e? This would be the ultimate BIP problem that generalizes better to different OOS data sets.

Daniel Armani, Ph.D.

Data Scientist | Scaffolder at PaperScore

2 个月

Meanwhile, while Lasso (Least Absolute Shrinkage and Selection Operator) is fast, its results are not BLUE. The penalty term that shrinks some coefficients toward zero to prevent overfitting, introduces bias. On the other hand, the results from GA and Stepwise Regression are BLUE, as they are the outcome of some OLS at the end. I am not sure why the Lasso technique remains so popular!

查看更多评论

要查看或添加评论,请登录