登录查看更多内容

Why Calculate Accuracy and AUC both in ML Experiment?

Nived Varma

Azure Expert Architect | Generative AI / ML implementation | 30+ years Enterprise Transformation Leader | Cross Cloud Integration | BI and Data Analytics

发布日期: 2025年3月17日

"In the world of machine learning, accuracy is merely what the model tells you it can do. AUC reveals what it's truly capable of. Together, they tell the complete story of your model's performance—one that can mean the difference between a solution that merely works and one that transforms your business."

First Code Snippet (Accuracy): Uses model.predict() to get predicted labels (y_hat). Calculates accuracy by comparing predicted labels with actual test labels (y_test) Accuracy measures the proportion of correct predictions among all predictions Output: "Accuracy: 0.774" means 77.4% of predictions match the actual values
Second Code Snippet (AUC): Uses model.predict_proba() to get prediction probabilities rather than just the predicted class Calculates AUC (Area Under the ROC Curve) using the roc_auc_score function AUC measures the model's ability to distinguish between classes across all possible classification thresholds Output: "AUC: 0.8484392253321388" means the model has about an 85% probability of ranking a random positive sample higher than a random negative sample

Why calculate both?

These metrics provide different insights about model performance:

Accuracy tells you the overall correctness but can be misleading with imbalanced data
AUC evaluates the model's ability to discriminate between classes regardless of the threshold chosen, making it more robust for imbalanced datasets

For example, in a dataset where 95% of samples are negative, a model that always predicts "negative" would have 95% accuracy but an AUC of 0.5 (no better than random guessing). Using both metrics gives you a more complete understanding of model performance.

The fact that this model has both good accuracy and AUC suggests it's performing well at both correctly classifying samples and ranking positive samples higher than negative ones.

Going little deeper with some sample data

Imagine you're a doctor trying to determine which patients have diabetes using a new screening test. You test 10 patients and record the following:

Actual Patient Status:

Patients 1-3: Have diabetes
Patients 4-10: Don't have diabetes

Test Results (Probability of Having Diabetes):

Patient 1: 0.85 (85%)
Patient 2: 0.70 (70%)
Patient 3: 0.55 (55%)
Patient 4: 0.60 (60%)
Patient 5: 0.40 (40%)
Patient 6: 0.30 (30%)
Patient 7: 0.25 (25%)
Patient 8: 0.20 (20%)
Patient 9: 0.15 (15%)
Patient 10: 0.10 (10%)

If you set your classification threshold at 0.50 (50%):

领英推荐

The Rubik’s Cube of Reason: Assembling a League of…

Emily Lewis, MS, CPDHTS, CCRP 12 个月前

How to Detect Multivariate Covariate Shift in Machine…

Krishna Yogi Kolluru 1 年前

Machine Learning for IRB Models: Challenges (I)

Asif Rajani 3 年前

Predicted to have diabetes: Patients 1, 2, 3, 4
Predicted not to have diabetes: Patients 5, 6, 7, 8, 9, 10

Accuracy Calculation:

Correct predictions: Patients 1, 2, 3 (true positives) + Patients 5, 6, 7, 8, 9, 10 (true negatives) = 9 patients
Total patients: 10
Accuracy = 9/10 = 0.90 or 90%

AUC Understanding: AUC measures how well your test ranks patients with diabetes higher than patients without diabetes. A perfect test would give all diabetes patients higher probabilities than all non-diabetes patients.

In our example, the test ranked patients as: 1 > 2 > 4 > 3 > 5 > 6 > 7 > 8 > 9 > 10

Notice there's one error in ranking: Patient 4 (who doesn't have diabetes) got a higher probability (0.60) than Patient 3 (who has diabetes, 0.55).

The AUC would be less than 1.0 because of this error, but still high (approximately 0.95) because most diabetes patients were ranked higher than non-diabetes patients.

This example shows why both metrics matter:

Accuracy (90%) tells you the proportion of correct diagnoses
AUC tells you how well the test distinguishes between patients with and without diabetes, regardless of the threshold you choose

What to do with this?

Threshold Optimization

Select Classification Threshold: If accuracy is 77.4% and AUC is 84.8%, a data scientist might experiment with different probability thresholds to find the optimal balance between sensitivity and specificity.
Business-Driven Decisions: Adjust thresholds based on the relative cost of false positives versus false negatives (e.g., in medical diagnosis, false negatives might be more costly than false positives).

Model Refinement

Feature Engineering: If metrics aren't satisfactory, explore new features or transformations to improve discriminative power.
Hyperparameter Tuning: Adjust model parameters to improve performance on these metrics.
Address Class Imbalance: If accuracy is high but AUC is lower, this might indicate class imbalance issues requiring techniques like resampling or weighted classes.

要查看或添加评论，请登录

Nived Varma的更多文章

Developing Applications using Azure Service Fabric - Change from Monolithic to Distributed approach

2018年7月18日

Developing Applications using Azure Service Fabric - Change from Monolithic to Distributed approach

In Part-1 of this series of article, I explained about fundamentals of microservices. We all use different types of…
Azure Service Fabric

2018年7月18日

Azure Service Fabric

Building Applications using Auto-scalable microservices in Azure Microsoft Azure has completed 8 years in providing…
Machine Learning

2018年7月16日

Machine Learning

It has become one of the most sought-after skill by enterprises. You may be wondering why? So let me answer this first…
Scalable Career

2015年7月31日

Scalable Career

Ladders can help you scaling heights by using them not by owing them..

3 条评论
Learn today!

2015年6月13日

Learn today!

Skills are more important then educational qualification Once a newly college pass out applied for a job in a company…

See all articles

Why Calculate Accuracy and AUC both in ML Experiment?

Nived Varma

Azure Expert Architect | Generative AI / ML implementation | 30+ years Enterprise Transformation Leader | Cross Cloud Integration | BI and Data Analytics

Why calculate both?

领英推荐

What to do with this?

Nived Varma的更多文章

社区洞察

其他会员也浏览了

Mastering Regularization Techniques in Machine Learning: L1, L2, Dropout, Data Augmentation, and Early Stopping

Weighted Ensemble in Machine Learning

Unveiling the Power of Metrics in Classification: A Comprehensive Guide

Understanding Regularization Techniques in Machine Learning: L1, L2, Dropout, Data Augmentation, and Early Stopping

THE PRINCIPLES OF MODEL EVALUATION IN MACHINE LEARNING

Cross-Validation in Machine Learning: Enhancing Model Reliability and Performance

Understanding Partial Dependence Plots: Importance and Applications

Resampling Methods: Balancing Data for Better Model Performance

how to select features

Why calculate both?

领英推荐

What to do with this?

Nived Varma的更多文章

Developing Applications using Azure Service Fabric - Change from Monolithic to Distributed approach

Azure Service Fabric

Machine Learning

Scalable Career

Learn today!

社区洞察

其他会员也浏览了

Mastering Regularization Techniques in Machine Learning: L1, L2, Dropout, Data Augmentation, and Early Stopping

Weighted Ensemble in Machine Learning

Unveiling the Power of Metrics in Classification: A Comprehensive Guide

Understanding Regularization Techniques in Machine Learning: L1, L2, Dropout, Data Augmentation, and Early Stopping

THE PRINCIPLES OF MODEL EVALUATION IN MACHINE LEARNING

Cross-Validation in Machine Learning: Enhancing Model Reliability and Performance

Understanding Partial Dependence Plots: Importance and Applications

Resampling Methods: Balancing Data for Better Model Performance

how to select features