Day 2 - Classification

After writing an exam, predicting if you would pass or fail constitutes a classification problem while quantifying the marks scored is a regression problem.

We can look at several examples -

  1. Predicting a person having a heart attack
  2. Predicting a student's passing or failing
  3. Predicting default of a home mortgage

In all these cases, we analyze the dataset, and not only do we predict the class of each case, but also measure the probability of a case belonging to a specific case.

The metrics to determine the efficacy of the model are going to be completely different - do you want to reduce the False Positives or Negatives? In short, what should be minimized the positives or negatives?

Diving into one of the most simple classification models - Logistic regression. Do not confuse with the term regression here, this is a classification model and very similar to Linear regression (regression model) but takes a discrete target field instead of a numerical one.

Logistic regression predicts the probability of an event occurence. The assumption is the relationship between the features and the target variable is linear, hence this model is highly prefered in cases of linear relationships.

This model can be used to predict if a patient has a disease, defaulting of loans etc.

Decision trees are non-linear models that allow both regerssion and classification. They split data into smaller and smaller subsets until each subset contains only members of one class. Basically the algorithms chose a variable at each step and split the datatset.

No alt text provided for this image
Sample decision tree


This model leverages several metrics to measure the "best" and used in classifying images, predict employee churn, or diagnose diseases. Gini impurity, entropy and variance are popular metrics.

This is an example of what kind of customer profile should the bank target for loans or perhaps which customer profile will avail a personal loan

import warning
warnings.filterwarnings("ignore")


# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np


# Library to split data
from sklearn.model_selection import train_test_split


# Libraries to help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree


# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="ticks")

# Removes the limit from the number of displayed columns and rows.
pd.set_option("display.max_columns", None)
# pd.set_option('display.max_rows', None)
pd.set_option("display.max_rows", 200)

# for statistical analysis?
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# To get diferent metric scores
from sklearn import metrics
from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, roc_curve, confusion_matrix, precision_recall_curve, f1_score
from sklearn.model_selection import GridSearchCV

from sklearn import tree
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifiers
data = pd.read_csv("path to the file that has the raw data")        

Visualize the first 10 rows

data.head(10))         
No alt text provided for this image

The binary category has five variables as below:

  • Personal Loan - This is our target variable and denotes whether the customer accepted the loan
  • Securities Account - Does the customer have a securities account with the bank?
  • CD Account - Does the customer have a CD account with the bank?
  • Online - Does the customer use online banking?
  • Credit Card - Does the customer use a credit card issued by the same bank

Some of the columns that can influence the acceptance of the loan are as follows:

  • Age - Age of the customer
  • Experience - Years of experience
  • Income - Annual income in dollars
  • CredCardAvg - Average credit card spending
  • Mortgage - Mortgage value outstanding

Categorical Variables are:

  • Family - Family size of the customer
  • Education - education level of the customer

The categorical variables are already in numerical format

Nominal variables are:

  • ID
  • zipcode

As we progress into model building we will drop the ID as it holds no significance

In addition, we can look at the shape of the data, duplicates, null values, unique & 5 point summary

# Shape of the data (Cols and Row information)

print("Shape {0}".format(data.shape)
print("\nInfo \n")
print(data.info()))

# Dupes, nulls, unique 
data[data.duplicated()].count()
data.isnull().values.any()
data.nunique()

# How many rows 
data['Education'].value_counts()        
No alt text provided for this image


No alt text provided for this image

Pertinent questions

  1. What do you do with columns such as zipcode?
  2. How do you handle negatives in a column?
  3. How are outliers treated?
  4. What metric should be considered? (should be driven based on business outcome)
  5. What correlated features should be dropped?
  6. What model in classification should be selected?

Which case is more important?

  1. Predicting a customer will take the personal loan and the customer doesn't take the loan
  2. Predicting a customer will not take the personal loan and the customer takes the loan

In general both cases are important but in this case we will try to address the 2nd use case.

  • Recall?is the metric of choice and the model selection should be based on higher RECALL values, basically need to reduce False Negatives

A few concepts worth remembering:

  • True Positives:
  • Reality: Correctly qualified the customer will take the loan
  • Model predicted: The customer will take the loan
  • Outcome: The model is good
  • True Negatives:
  • Reality: A customer did NOT take the loan
  • Model predicted: Customer will not take the loan
  • Outcome: The business is unaffected.
  • False Positives:
  • Reality: A customer did NOT avail the Loan
  • Model predicted: The customer will take the loan
  • Outcome: Need to investigate specifically why the customer did not take the loan
  • False Negatives:
  • Reality: A customer took the loan
  • Model predicted: The customer will not and should not be given the loan
  • Outcome: Close to 7% of the customers fall into this category, need more investigation

After completing EDA -

# Split train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state = 1)

model = LogisticRegression(random_state=1)
lg = model.fit(X_train,y_train)

# creating confusion matrix
make_confusion_matrix(lg,X_test,y_test)

# checking                 
scores_LR = get_metrics_score(lg,X_train,X_test,y_train,y_test,flag=True))        
No alt text provided for this image
# How can we further optimize?
#                  as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = metrics.roc_curve(y_test, lg.predict_proba(X_test)[:,1])


optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
        

Output:

0.18131736558627232
        


scores_LR = get_metrics_score(lg,X_train,X_test,y_train,y_test,threshold=optimal_threshold_auc_roc,roc=True)
        


Accuracy on training set :  0.897428571428571
Accuracy on test set :  0.902
Recall on training set :  0.5912386706948641
Recall on test set :  0.5590604026845637
Precision on training set :  0.5773462783171521
Precision on test set :  0.6039370078740157        

As you can see the Recall has improved considerably.

What could be the optimal threshold?

No alt text provided for this image
Optimal Threshold is around the intersection of the curve, around 0.3


scores_LR = get_metrics_score(lg,X_train,X_test,y_train,y_test,threshold=optimal_threshold_curve,roc=True)
        


Accuracy on training set :  0.938571428571428
Accuracy on test set :  0.9393333333333334
Recall on training set :  0.650392749244713
Recall on test set :  0.6216778523489933
Precision on training set :  0.6012081218274112
Precision on test set :  0.6046987951807228        

The model is performing very well at 0.3 and also much better than stand-alone logistic regression.

Similarly can run the tests for the following changes to the model

  1. Logistic regression with scaled modeling (transform the independent variables using techniques such as Standardization)
  2. Logistic regression with Sequential Feature Selector (greedy algorithm that adds or removes features based on the metric score)

Review the Recall for training and test state, and plot ROC-AUC curve for both train and test datasets. The model with the best scores is obviously the choice.

Day 3: Decision Trees and wrap-up on Classification













要查看或添加评论,请登录

Kiran K.的更多文章

  • A primer to AI, ML, DL

    A primer to AI, ML, DL

    Throughout my years of studying and applying AI to solve problems in my domain, I have always wanted to provide a…

    4 条评论

社区洞察

其他会员也浏览了