Q1 answer

Imagine you have a classification problem (with 2 class) in which you use logistic regression or any model that returns the probability of being in each class. As you may know, using a specific threshold you decide whether a sample belongs to this class or not. By default, we use 0.5 as a threshold in most implementations. Now it’s a question: how you find the best threshold in your classification problem?!

No alt text provided for this image

ROC curve shows True positive rate vs False positive rate. It shows that in different values of FPR what is the value of TPR. So you can imagine it actually changes the threshold for classification. So you can choose a threshold in which the ROC curve has saturate form there. By rewriting the threshold, you can change recall and precision. Sometimes you have some restrictions on your problem. Think about automatic punishment bills. You need to be precise instead of having a high recall. So plot recall and precision vs threshold can help you find better choices. So in your classification problems do not rely on the default threshold.

Alireza Hashemi

PhD Student in Physics | Data Science & Machine Learning

4 年

an important thing to note is not to touch this threshold at any cost. you can't pick a threshold from PR curve plot, which corresponds to your desired FP rate and just use it! In order to fine-tune a logistic regression classification, you should use class weights and plot FR rate vs class weight and choose the proper class weight.

Ehsan Yaghoubi

Mechanical/Energy engineering graduate

4 年

For sure there's going to be a trade-off between Precision vs. Recall (or in other words, minimizing false negatives vs. false positives) depending on the nature of the classifications, specifically for skewed classes, e.g. cancer diagnosis classifications. Accordingly, there's not a unique prescription as the absolute best solution. However, there is an interesting, practical approach mentioned by Prof. Andrew Ng in his online Machine Learning Course offered by Stanford university. Simply try a range of thresholds, and then pick whatever value of threshold which gives you the highest F1 score on your cross validation set. F1 score is commonly defined as follows: F1 = 2 * (P*R) / (P+R) Where, P and R stand for "Precision" and "Recall" respectively. P = (Num. True pos.)/(Num. True pos + Num. False pos) R = (Num. True pos.)/(Num. True pos + Num. False neg.) Hope it was useful for you :)

回复
Kavan Alipanahi

Data Scientist | AI Specialist

4 年

That's great . Thanks for sharing this . Is there anywhere specific I can find these questions?

Hamed Pouriayevali

Fuel Cell Stack Engineer | Flexible Graphite Bipolar Plates | Research & Development

4 年

I assume using a ROC curve would be helpful; by choosing the right threshold value, considering whether we prioritize the elimination of False Negative or False Positive predictions, depending on the nature of the problem.

要查看或添加评论,请登录

Ali Ghandi的更多文章

  • Q2

    Q2

    There are two famous regularization methods for regression. Lasso and Ridge regression.

    1 条评论

社区洞察

其他会员也浏览了