3 1-Minute Hacks to Improve Your Models


1) If your sole purpose is to predict but not perform any statistical inference, you can speed up your logistic regression by numericizing the binary outcome to 0 and 1 to perform a linear regression on a continuous outcome. This is because the logistic regression converts the outcome 0 and 1 to log(p/(1-p) to ensure that p, the probability of the outcome, falls between 0 and 1. This is a monotone conversion that keeps the ordering of the predicted scores exactly the same. Hence, if you convert your logit outcome back to a numeric 0 and 1, your area under the curve would remain exactly the same and you'll arrive at the same sensitivity and specificity (or precision and recall, if you choose) that you consider optimal.

Speeding up a logistic regression is very important if you're using a computationally burdensome model like a mixed effects model. Rather than using the logit version of it, you can stick to the much faster linear mixed effects model to ensure you can exhaust more combinations of input variables, etc. quickly.


2) Similar to the above, if you're predicting a binary outcome using any model, you can use the regressor version of the model to predict the outcome in its continuous form if its raw form came in a continuous form. I've often found that using a RandomForestRegressor predicts better than using a RandomForestClassifier, for example, and this accords with the rule of thumb that categorization makes you 'lose information.' The threshold that leads to your optimal sensitivity and specificity can be your threshold for dichotomization, or if you had a fixed one to begin with, you can still claim it works for the pretermined one as you are 'adding or subtracting bias,' just as the ROC curve exists precisely because you don't have to pick p=0.5 when running a logistic regression - compensation for the bias is implicitly accepted when people move along thresholds to different points on the ROC curve. (The intuition of correcting bias is also similar to what gradient boosting allows for but I won't elaborate on it here.)


3) Most of the methods for naive Bayes classifiers such as tf-idf and variable selection techniques (information gain, chi-square, cluster representation quality, etc.) work well on other high-dimensional methodologies like k-nearest-neighbors and gradient boosting machines. There's no reason these other methods wouldn't benefit from feature selection and tf-idf, which downweights 'throwaway' features that don't distinguish much between different classification outcomes. The advantage of using these naive Bayes methods lies in their ability to be written in simple mathematical operators, and sometimes, in a few lines too. The former allows for ease of use in very fast languages like Java, which would already have a few canonical ML algorithms like KNN in place.

要查看或添加评论,请登录

Alice SH Wong的更多文章

社区洞察

其他会员也浏览了