Q1 answer
Ali Ghandi
Lead Data Scientist | Passionate about Solving Complex Problems| Machine Learning, Deep Learning, NLP, and Generative AI
Imagine you have a classification problem (with 2 class) in which you use logistic regression or any model that returns the probability of being in each class. As you may know, using a specific threshold you decide whether a sample belongs to this class or not. By default, we use 0.5 as a threshold in most implementations. Now it’s a question: how you find the best threshold in your classification problem?!
ROC curve shows True positive rate vs False positive rate. It shows that in different values of FPR what is the value of TPR. So you can imagine it actually changes the threshold for classification. So you can choose a threshold in which the ROC curve has saturate form there. By rewriting the threshold, you can change recall and precision. Sometimes you have some restrictions on your problem. Think about automatic punishment bills. You need to be precise instead of having a high recall. So plot recall and precision vs threshold can help you find better choices. So in your classification problems do not rely on the default threshold.
PhD Student in Physics | Data Science & Machine Learning
4 年an important thing to note is not to touch this threshold at any cost. you can't pick a threshold from PR curve plot, which corresponds to your desired FP rate and just use it! In order to fine-tune a logistic regression classification, you should use class weights and plot FR rate vs class weight and choose the proper class weight.
Mechanical/Energy engineering graduate
4 年For sure there's going to be a trade-off between Precision vs. Recall (or in other words, minimizing false negatives vs. false positives) depending on the nature of the classifications, specifically for skewed classes, e.g. cancer diagnosis classifications. Accordingly, there's not a unique prescription as the absolute best solution. However, there is an interesting, practical approach mentioned by Prof. Andrew Ng in his online Machine Learning Course offered by Stanford university. Simply try a range of thresholds, and then pick whatever value of threshold which gives you the highest F1 score on your cross validation set. F1 score is commonly defined as follows: F1 = 2 * (P*R) / (P+R) Where, P and R stand for "Precision" and "Recall" respectively. P = (Num. True pos.)/(Num. True pos + Num. False pos) R = (Num. True pos.)/(Num. True pos + Num. False neg.) Hope it was useful for you :)
Data Scientist | AI Specialist
4 年That's great . Thanks for sharing this . Is there anywhere specific I can find these questions?
Fuel Cell Stack Engineer | Flexible Graphite Bipolar Plates | Research & Development
4 年I assume using a ROC curve would be helpful; by choosing the right threshold value, considering whether we prioritize the elimination of False Negative or False Positive predictions, depending on the nature of the problem.