登录查看更多内容

3 1-Minute Hacks to Improve Your Models

Alice SH Wong

Data Science Decadenarian

发布日期: 2020年10月14日

1) If your sole purpose is to predict but not perform any statistical inference, you can speed up your logistic regression by numericizing the binary outcome to 0 and 1 to perform a linear regression on a continuous outcome. This is because the logistic regression converts the outcome 0 and 1 to log(p/(1-p) to ensure that p, the probability of the outcome, falls between 0 and 1. This is a monotone conversion that keeps the ordering of the predicted scores exactly the same. Hence, if you convert your logit outcome back to a numeric 0 and 1, your area under the curve would remain exactly the same and you'll arrive at the same sensitivity and specificity (or precision and recall, if you choose) that you consider optimal.

Speeding up a logistic regression is very important if you're using a computationally burdensome model like a mixed effects model. Rather than using the logit version of it, you can stick to the much faster linear mixed effects model to ensure you can exhaust more combinations of input variables, etc. quickly.

2) Similar to the above, if you're predicting a binary outcome using any model, you can use the regressor version of the model to predict the outcome in its continuous form if its raw form came in a continuous form. I've often found that using a RandomForestRegressor predicts better than using a RandomForestClassifier, for example, and this accords with the rule of thumb that categorization makes you 'lose information.' The threshold that leads to your optimal sensitivity and specificity can be your threshold for dichotomization, or if you had a fixed one to begin with, you can still claim it works for the pretermined one as you are 'adding or subtracting bias,' just as the ROC curve exists precisely because you don't have to pick p=0.5 when running a logistic regression - compensation for the bias is implicitly accepted when people move along thresholds to different points on the ROC curve. (The intuition of correcting bias is also similar to what gradient boosting allows for but I won't elaborate on it here.)

3) Most of the methods for naive Bayes classifiers such as tf-idf and variable selection techniques (information gain, chi-square, cluster representation quality, etc.) work well on other high-dimensional methodologies like k-nearest-neighbors and gradient boosting machines. There's no reason these other methods wouldn't benefit from feature selection and tf-idf, which downweights 'throwaway' features that don't distinguish much between different classification outcomes. The advantage of using these naive Bayes methods lies in their ability to be written in simple mathematical operators, and sometimes, in a few lines too. The former allows for ease of use in very fast languages like Java, which would already have a few canonical ML algorithms like KNN in place.

要查看或添加评论，请登录

Alice SH Wong的更多文章

Logistic Regression: Basics, Obscurities and its Membership as a Classifier

2021年8月26日

Logistic Regression: Basics, Obscurities and its Membership as a Classifier

1) Logistic regression (LR) is a regression. And yes, it's also a classifier, insofar as the predicted log odds is a…

26 条评论
Boundaries: Consistency over Levels?

2021年8月12日

Boundaries: Consistency over Levels?

I have to preface this by saying I have an instinctive cynicism about the word 'boundaries' even if it doesn't mean I…

12 条评论
Making Remote Social Dynamics Work

2021年2月20日

Making Remote Social Dynamics Work

1) Scrap hub-and-spoke model The hub-and-spoke model can be useful for finding default contacts especially within other…

4 条评论
Self-Made in Data Science: A Good Idea?

2020年11月4日

Self-Made in Data Science: A Good Idea?

What exactly does 'self-made' mean? Self-made or not isn't a binary condition. After all, everyone is self-made to some…

14 条评论
Top 10 Most Annoying Data Science Topics on LinkedIn

2020年9月13日

Top 10 Most Annoying Data Science Topics on LinkedIn

LinkedIn posts on Data Science are a carousel of the same ten or so Data Science topics floating around. If these…

18 条评论
A Lay Interpretation of Statistical Significance and p-values: Slow Burn, Love at First Sight or Enduring Love?

2020年8月14日

A Lay Interpretation of Statistical Significance and p-values: Slow Burn, Love at First Sight or Enduring Love?

It's rare that a lay interpretation of a technical matter will be more accurate or less misleading than its technical…

4 条评论
An Evaluation of pycaret's 'Regression - Level Beginner'

2020年4月13日

An Evaluation of pycaret's 'Regression - Level Beginner'

LinkedIn seems aflush these days with demonstrations of pycaret and demonstrations of covid-19 prediction skills, with…
Help Fund BidnBuddy's Clients!

2019年3月12日

Help Fund BidnBuddy's Clients!

I am very happy to announce that BidnBuddy has taken off much better than I'd expected. I received a paltry $70 for my…
Buddy Up or Bid Up: An App for Enhanced Matched Funding

2019年2月7日

Buddy Up or Bid Up: An App for Enhanced Matched Funding

How it works: My funding app BidnBuddy at https://apploft.shinyapps.
Love, Actually

2018年12月13日

Love, Actually

It's holiday season again, and again, some movie called Love, Actually seems to be airing all over the drive-ins…

See all articles

3 1-Minute Hacks to Improve Your Models

Alice SH Wong

Data Science Decadenarian

Alice SH Wong的更多文章

社区洞察

其他会员也浏览了

Why Mean Squared Error for Linear Regression?

Demystifying Machine Learning: A Guided Tour of the Top 10 Algorithms

What Is Regression In Machine Learning?

Linear Regression

What Is Polynomial Regression in Machine Learning?

What Is Lasso and Ridge Regression in Machine Learning?

Machine Learning 7:'Classification' Day 3

XGBoost

Mastering Logistic Regression: Predictive Precision in Binary Classification

Machine Learning : 'Regression' - Day 2

Alice SH Wong的更多文章

Logistic Regression: Basics, Obscurities and its Membership as a Classifier

Boundaries: Consistency over Levels?

Making Remote Social Dynamics Work

Self-Made in Data Science: A Good Idea?

Top 10 Most Annoying Data Science Topics on LinkedIn

A Lay Interpretation of Statistical Significance and p-values: Slow Burn, Love at First Sight or Enduring Love?

An Evaluation of pycaret's 'Regression - Level Beginner'

Help Fund BidnBuddy's Clients!

Buddy Up or Bid Up: An App for Enhanced Matched Funding

Love, Actually

社区洞察

其他会员也浏览了

Why Mean Squared Error for Linear Regression?

Demystifying Machine Learning: A Guided Tour of the Top 10 Algorithms

What Is Regression In Machine Learning?

Linear Regression

What Is Polynomial Regression in Machine Learning?

What Is Lasso and Ridge Regression in Machine Learning?

Machine Learning 7:'Classification' Day 3

XGBoost

Mastering Logistic Regression: Predictive Precision in Binary Classification

Machine Learning : 'Regression' - Day 2