登录查看更多内容

How good is your model?

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE

Mortgage World Bankers - Predictive modeling for residential & commercial Lending in NY, NJ, CT, PA, FL

发布日期: 2018年11月6日

Metrics for classification

The performance of k-NN classifier based on its accuracy. However, accuracy is not always an informative metric. We will dive more deeply into evaluating the performance of binary classifiers by computing a confusion matrix and generating a classification report.

Classification metrics

● Measuring model performance with accuracy:

● Fraction of correctly classified samples

● Not always a useful metric

Class imbalance example: Emails

● Spam classification

● 99% of emails are real; 1% of emails are spam

● Could build a classifier that predicts ALL emails as real

● 99% accurate!

● But horrible at actually classifying spam

● Fails at its original purpose

● Need more nuanced metrics

You may have noticed that the classification report consisted of three rows, and an additional support column. The support gives the number of samples of the true response that lie in that class - the support was the number of Republicans or Democrats in the test set on which the classification report was computed. The precision, recall, and f1-score columns, then, gave the respective metrics for that particular class.

Here, you'll work with the PIMA Indians dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes. The dataset has been pre-processed to deal with missing values.

The dataset has been loaded into a DataFrame df and the feature and target variable arrays X and y have been created for you. In addition, sklearn.model_selection.train_test_split and sklearn.neighbors.KNeighborsClassifier have already been imported.

We will train a k-NN classifier to the data and evaluate its performance by generating a confusion matrix and classification report.

Import classification_report and confusion_matrix from sklearn.metrics.

Create training & testing sets with 40% of data used for testing. Use a random state of 42.

Instantiate a k-NN classifier with 6 neighbors, fit it to the training data, and predict the labels of the test set.

Compute and print the confusion matrix and classification report using the confusion_matrix() and classification_report() functions.

The precision, recall, and f1-score columns, then, gave the respective metrics for that particular class. By analyzing the confusion matrix and classification report, you can get a much better understanding of your classifier's performance.

Bigu Haider

Working Benefits and organizing. In Taxi Workers Alliance

6 年

Abu Chowdhury should get noble prize

要查看或添加评论，请登录

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE的更多文章

PCA - Principal Component Analysis

2018年12月15日

PCA - Principal Component Analysis

Dimension reduction ● More efficient storage and computation ● Remove less-informative "noise" features ● ..
Predicting breast cancer with Machine Learning (K-NN, SVM, RFC)

2018年12月9日

Predicting breast cancer with Machine Learning (K-NN, SVM, RFC)

This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H.

1 条评论
Unsupervised learning using Scikit Learn

2018年12月7日

Unsupervised learning using Scikit Learn

Supervised vs unsupervised learning ● Supervised learning finds ptterns for a prediction task ● E.g.
Machine Learning School Budgets

2018年11月22日

Machine Learning School Budgets

Loading the data Now it's time to check out the dataset! You'll use pandas (which has been pre-imported as pd) to load…
Pre-processing data in Python for Machine Learning

2018年11月8日

Pre-processing data in Python for Machine Learning

Exploring categorical features The Gapminder dataset that contained a categorical 'Region' feature, which we dropped in…
Building a logistic regression model and the ROC curve; Hyperparameter tuning with GridSearchCV

2018年11月6日

Building a logistic regression model and the ROC curve; Hyperparameter tuning with GridSearchCV

Logistic regression for binary classification ● Logistic regression outputs probabilities ● If the probability ‘p’ is…
Fit & predict for regression

2018年11月5日

Fit & predict for regression

If your problem requires a continuous outcome? Regression, which is best suited to solving such problems. The…
#MachineLearning Train/Test Split + Fit/Predict/Accuracy

2018年11月4日

#MachineLearning Train/Test Split + Fit/Predict/Accuracy

We will introduce the classification problems and learn how to solve them using supervised learning techniques…
EDA - Hospital Readmissions Data Analysis and Recommendations for Reduction?

2018年10月18日

EDA - Hospital Readmissions Data Analysis and Recommendations for Reduction?

Background? In October 2012, the US government's Center for Medicare and Medicaid Services (CMS) began reducing…
Examine Racial Discrimination in the US Job Market Using Exploratory Data Analysis

2018年10月16日

Examine Racial Discrimination in the US Job Market Using Exploratory Data Analysis

Background When racial disparities in life outcomes occur, explicit or subtle prejudice leading to discriminatory…

See all articles

How good is your model?

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE

Mortgage World Bankers - Predictive modeling for residential & commercial Lending in NY, NJ, CT, PA, FL

Metrics for classification

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE的更多文章

社区洞察

其他会员也浏览了

Feature Selection and Dimensionality Reduction

Making Sense of Data Features

Decision Tree: How Does It Work in Today's Context?

Understanding Cross-Validation: Different Approaches

Crucial Metrics Decoded: "R2" and "Adjusted R2" Unveiled for Data Scientists

Support Vector Machine- Simple analysis

One-Class Support Vector Machines

Data Leakage for Time-dependent Data and Features in Machine Learning

Data Optimizations Techniques in the Machine Learning

Isolation Forest- An overview

Metrics for classification

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE的更多文章

PCA - Principal Component Analysis

Predicting breast cancer with Machine Learning (K-NN, SVM, RFC)

Unsupervised learning using Scikit Learn

Machine Learning School Budgets

Pre-processing data in Python for Machine Learning

Building a logistic regression model and the ROC curve; Hyperparameter tuning with GridSearchCV

Fit & predict for regression

#MachineLearning Train/Test Split + Fit/Predict/Accuracy

EDA - Hospital Readmissions Data Analysis and Recommendations for Reduction?

Examine Racial Discrimination in the US Job Market Using Exploratory Data Analysis

社区洞察

其他会员也浏览了

Feature Selection and Dimensionality Reduction

Making Sense of Data Features

Decision Tree: How Does It Work in Today's Context?

Understanding Cross-Validation: Different Approaches

Crucial Metrics Decoded: "R2" and "Adjusted R2" Unveiled for Data Scientists

Support Vector Machine- Simple analysis

One-Class Support Vector Machines

Data Leakage for Time-dependent Data and Features in Machine Learning

Data Optimizations Techniques in the Machine Learning

Isolation Forest- An overview