登录查看更多内容

Day 2 - Classification

Kiran K.

发布日期: 2023年6月18日

After writing an exam, predicting if you would pass or fail constitutes a classification problem while quantifying the marks scored is a regression problem.

We can look at several examples -

Predicting a person having a heart attack
Predicting a student's passing or failing
Predicting default of a home mortgage

In all these cases, we analyze the dataset, and not only do we predict the class of each case, but also measure the probability of a case belonging to a specific case.

The metrics to determine the efficacy of the model are going to be completely different - do you want to reduce the False Positives or Negatives? In short, what should be minimized the positives or negatives?

Diving into one of the most simple classification models - Logistic regression. Do not confuse with the term regression here, this is a classification model and very similar to Linear regression (regression model) but takes a discrete target field instead of a numerical one.

Logistic regression predicts the probability of an event occurence. The assumption is the relationship between the features and the target variable is linear, hence this model is highly prefered in cases of linear relationships.

This model can be used to predict if a patient has a disease, defaulting of loans etc.

Decision trees are non-linear models that allow both regerssion and classification. They split data into smaller and smaller subsets until each subset contains only members of one class. Basically the algorithms chose a variable at each step and split the datatset.

No alt text provided for this image — Sample decision tree

This model leverages several metrics to measure the "best" and used in classifying images, predict employee churn, or diagnose diseases. Gini impurity, entropy and variance are popular metrics.

This is an example of what kind of customer profile should the bank target for loans or perhaps which customer profile will avail a personal loan

import warning
warnings.filterwarnings("ignore")


# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np


# Library to split data
from sklearn.model_selection import train_test_split


# Libraries to help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree


# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="ticks")

# Removes the limit from the number of displayed columns and rows.
pd.set_option("display.max_columns", None)
# pd.set_option('display.max_rows', None)
pd.set_option("display.max_rows", 200)

# for statistical analysis?
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# To get diferent metric scores
from sklearn import metrics
from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, roc_curve, confusion_matrix, precision_recall_curve, f1_score
from sklearn.model_selection import GridSearchCV

from sklearn import tree
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifiers
data = pd.read_csv("path to the file that has the raw data")

Visualize the first 10 rows

data.head(10))

The binary category has five variables as below:

Personal Loan - This is our target variable and denotes whether the customer accepted the loan
Securities Account - Does the customer have a securities account with the bank?
CD Account - Does the customer have a CD account with the bank?
Online - Does the customer use online banking?
Credit Card - Does the customer use a credit card issued by the same bank

Some of the columns that can influence the acceptance of the loan are as follows:

Age - Age of the customer
Experience - Years of experience
Income - Annual income in dollars
CredCardAvg - Average credit card spending
Mortgage - Mortgage value outstanding

Categorical Variables are:

Family - Family size of the customer
Education - education level of the customer

The categorical variables are already in numerical format

Nominal variables are:

ID
zipcode

As we progress into model building we will drop the ID as it holds no significance

In addition, we can look at the shape of the data, duplicates, null values, unique & 5 point summary

# Shape of the data (Cols and Row information)

print("Shape {0}".format(data.shape)
print("\nInfo \n")
print(data.info()))

# Dupes, nulls, unique 
data[data.duplicated()].count()
data.isnull().values.any()
data.nunique()

# How many rows 
data['Education'].value_counts()

Pertinent questions

What do you do with columns such as zipcode?
How do you handle negatives in a column?
How are outliers treated?
What metric should be considered? (should be driven based on business outcome)
What correlated features should be dropped?
What model in classification should be selected?

Which case is more important?

Predicting a customer will take the personal loan and the customer doesn't take the loan
Predicting a customer will not take the personal loan and the customer takes the loan

In general both cases are important but in this case we will try to address the 2nd use case.

领英推荐

The Three Most Common Statistical Tests You Should…

Keith McNulty 6 个月前

A Comprehensive Guide to Logistic Regression in…

Onurdesk 3 个月前

Quantile Regression Random Forests

Charaf Z. 6 个月前

Recall?is the metric of choice and the model selection should be based on higher RECALL values, basically need to reduce False Negatives

A few concepts worth remembering:

True Positives:
Reality: Correctly qualified the customer will take the loan
Model predicted: The customer will take the loan
Outcome: The model is good
True Negatives:
Reality: A customer did NOT take the loan
Model predicted: Customer will not take the loan
Outcome: The business is unaffected.
False Positives:
Reality: A customer did NOT avail the Loan
Model predicted: The customer will take the loan
Outcome: Need to investigate specifically why the customer did not take the loan
False Negatives:
Reality: A customer took the loan
Model predicted: The customer will not and should not be given the loan
Outcome: Close to 7% of the customers fall into this category, need more investigation

After completing EDA -

# Split train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state = 1)

model = LogisticRegression(random_state=1)
lg = model.fit(X_train,y_train)

# creating confusion matrix
make_confusion_matrix(lg,X_test,y_test)

# checking                 
scores_LR = get_metrics_score(lg,X_train,X_test,y_train,y_test,flag=True))

# How can we further optimize?
#                  as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = metrics.roc_curve(y_test, lg.predict_proba(X_test)[:,1])


optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)

Output:

0.18131736558627232

scores_LR = get_metrics_score(lg,X_train,X_test,y_train,y_test,threshold=optimal_threshold_auc_roc,roc=True)

Accuracy on training set :  0.897428571428571
Accuracy on test set :  0.902
Recall on training set :  0.5912386706948641
Recall on test set :  0.5590604026845637
Precision on training set :  0.5773462783171521
Precision on test set :  0.6039370078740157

As you can see the Recall has improved considerably.

What could be the optimal threshold?

scores_LR = get_metrics_score(lg,X_train,X_test,y_train,y_test,threshold=optimal_threshold_curve,roc=True)

Accuracy on training set :  0.938571428571428
Accuracy on test set :  0.9393333333333334
Recall on training set :  0.650392749244713
Recall on test set :  0.6216778523489933
Precision on training set :  0.6012081218274112
Precision on test set :  0.6046987951807228

The model is performing very well at 0.3 and also much better than stand-alone logistic regression.

Similarly can run the tests for the following changes to the model

Logistic regression with scaled modeling (transform the independent variables using techniques such as Standardization)
Logistic regression with Sequential Feature Selector (greedy algorithm that adds or removes features based on the metric score)

Review the Recall for training and test state, and plot ROC-AUC curve for both train and test datasets. The model with the best scores is obviously the choice.

Day 3: Decision Trees and wrap-up on Classification

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

要查看或添加评论，请登录

Kiran K.的更多文章

A primer to AI, ML, DL

2023年6月16日

A primer to AI, ML, DL

Throughout my years of studying and applying AI to solve problems in my domain, I have always wanted to provide a…

4 条评论

Day 2 - Classification

Kiran K.

领英推荐

Kiran K.的更多文章

社区洞察

其他会员也浏览了

How to index data into Vector DB from highly unstructured pdfs

Logistic Regression: Basics, Obscurities and its Membership as a Classifier

?? Unlock Time Series Insights Using Python’s KPSS Test ??

Application of Logistic Regression with LASSO regularization to predicting March Madness Results

How logistic regression can save the day?

A simple and robust ordinary linear regression (OLS) model for evaluating reasonable value of stock indices

Day 13 — Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Ordinary Least Squares (OLS) Regression - Estimate R/L between stock Average Price and SMA Value

Regularization in Regression: A Simple Guide to Lasso and Ridge

领英推荐

Kiran K.的更多文章

A primer to AI, ML, DL

社区洞察

其他会员也浏览了

How to index data into Vector DB from highly unstructured pdfs

Logistic Regression: Basics, Obscurities and its Membership as a Classifier

?? Unlock Time Series Insights Using Python’s KPSS Test ??

Application of Logistic Regression with LASSO regularization to predicting March Madness Results

How logistic regression can save the day?

A simple and robust ordinary linear regression (OLS) model for evaluating reasonable value of stock indices

Day 13 — Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Ordinary Least Squares (OLS) Regression - Estimate R/L between stock Average Price and SMA Value

Regularization in Regression: A Simple Guide to Lasso and Ridge