Day 2 - Classification
After writing an exam, predicting if you would pass or fail constitutes a classification problem while quantifying the marks scored is a regression problem.
We can look at several examples -
In all these cases, we analyze the dataset, and not only do we predict the class of each case, but also measure the probability of a case belonging to a specific case.
The metrics to determine the efficacy of the model are going to be completely different - do you want to reduce the False Positives or Negatives? In short, what should be minimized the positives or negatives?
Diving into one of the most simple classification models - Logistic regression
Logistic regression predicts the probability of an event occurence. The assumption is the relationship between the features and the target variable is linear, hence this model is highly prefered in cases of linear relationships.
This model can be used to predict if a patient has a disease, defaulting of loans etc.
Decision trees
This model leverages several metrics to measure the "best" and used in classifying images, predict employee churn, or diagnose diseases. Gini impurity, entropy and variance are popular metrics.
This is an example of what kind of customer profile should the bank target for loans or perhaps which customer profile will avail a personal loan
import warning
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# Libraries to help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit from the number of displayed columns and rows.
pd.set_option("display.max_columns", None)
# pd.set_option('display.max_rows', None)
pd.set_option("display.max_rows", 200)
# for statistical analysis?
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
from import add_constant
# To get diferent metric scores
from sklearn import metrics
from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, roc_curve, confusion_matrix, precision_recall_curve, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn import tree
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifiers
data = pd.read_csv("path to the file that has the raw data")
Visualize the first 10 rows
The binary category has five variables as below:
Some of the columns that can influence the acceptance of the loan are as follows:
Categorical Variables are:
The categorical variables are already in numerical format
Nominal variables are:
As we progress into model building we will drop the ID as it holds no significance
In addition, we can look at the shape of the data, duplicates, null values, unique & 5 point summary
# Shape of the data (Cols and Row information)
print("Shape {0}".format(data.shape)
print("\nInfo \n")
# Dupes, nulls, unique
# How many rows
Pertinent questions
Which case is more important?
In general both cases are important but in this case we will try to address the 2nd use case.
A few concepts worth remembering:
After completing EDA -
# Split train and test datasets X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state = 1) model = LogisticRegression(random_state=1) lg =,y_train) # creating confusion matrix make_confusion_matrix(lg,X_test,y_test) # checking scores_LR = get_metrics_score(lg,X_train,X_test,y_train,y_test,flag=True))
# How can we further optimize? # as per AUC-ROC curve # The optimal cut off would be where tpr is high and fpr is low fpr, tpr, thresholds = metrics.roc_curve(y_test, lg.predict_proba(X_test)[:,1]) optimal_idx = np.argmax(tpr - fpr) optimal_threshold_auc_roc = thresholds[optimal_idx] print(optimal_threshold_auc_roc)
scores_LR = get_metrics_score(lg,X_train,X_test,y_train,y_test,threshold=optimal_threshold_auc_roc,roc=True)
Accuracy on training set : 0.897428571428571
Accuracy on test set : 0.902
Recall on training set : 0.5912386706948641
Recall on test set : 0.5590604026845637
Precision on training set : 0.5773462783171521
Precision on test set : 0.6039370078740157
As you can see the Recall has improved considerably.
What could be the optimal threshold?
scores_LR = get_metrics_score(lg,X_train,X_test,y_train,y_test,threshold=optimal_threshold_curve,roc=True)
Accuracy on training set : 0.938571428571428
Accuracy on test set : 0.9393333333333334
Recall on training set : 0.650392749244713
Recall on test set : 0.6216778523489933
Precision on training set : 0.6012081218274112
Precision on test set : 0.6046987951807228
The model is performing very well at 0.3 and also much better than stand-alone logistic regression.
Similarly can run the tests for the following changes to the model
Review the Recall for training and test state, and plot ROC-AUC curve for both train and test datasets. The model with the best scores is obviously the choice.
Day 3: Decision Trees and wrap-up on Classification