Development of credit scoring models and Cut-off score setting in R.
Tony Gitonga
Data Strategy | Finance Analytics | Digital Banking | Corporate Strategy | Change Management | Data Science | Machine Learning | Deep Learning | AI
A credit score is a numerical representation of an individual's creditworthiness, commonly used by credit lending institutions like banks, telecom and fintechs. These organizations employ complex credit models that consider various factors that run from customer demographic details, their sources of income, credit commitments, and past loan performance to calculate a credit score.
Credit scoring models can broadly be categorized into:
A credit score indicates the likelihood of timely repayment if a loan or credit card is advanced to the person. Overtime different scorecards have been developed to address specific needs of the business for example:
They are used by lenders to determine the interest rates and terms offered to borrowers. Each scorecard is designed to assess specific aspects of credit risk and aid in decision-making processes related to lending, collections, fraud detection, or pricing.
EDA - Exploratory Data Analysis
This is the initial exploration and analysis of the credit-related data to gain insights and understanding of the variables and their relationships. It helps to identify patterns, trends, outliers, and potential issues within the data. R software has fantastic packages that automates this tedious task. Some of these libraries are:
There are many others but these are my personal favorite.
Step 1: Prepare the environment.
R software is open source and offers a rich collection of packages for credit scoring. Before you start your project remember to:
# Clear all objects including hidden objects
rm(list = ls(all.names = TRUE))
# Free up memory and report the memory usage.
gc()
The you can load the desired packages
library(ggplot2
library(pROC)
library(reshape2)
library(dplyr)
library(scales)
library(tidyr)
library(kableExtra)
library(knitr)
library(data.table)
library(ggpubr)
library(gridExtra)
library(lubridate)
library(SmartEDA)
library(caret)
library(randomForest)
library(scorecard)
library(pacman)
library(patchwork)
library(woeBinning))
You need to load the data in R and perform initial ETL that entails handling dates and other data types during the loading process. There are various ways to load the data into R depending on the source of the data. For simplicity purpose I will assume my data is a csv file.
data <- read.csv("hw_table.csv", header = TRUE, stringsAsFactors = FALSE
#---------------- Format Dates
data$var41 <- as.POSIXct(data$var41, format = "%Y-%m-%d")
data$var38 <- as.POSIXct(data$var38, format = "%Y-%m-%d"))
Next we create a brief description of our data set just to understand what we are dealing with:
The code below creates the table above, I have used the pipe coding format that allows one achieve a task in a block of code that is easy to comprehend.
data %>
? SmartEDA::ExpData(type = 1) %>%
? kbl() %>%
? kable_minimal() %>%
? kable_paper("hover", full_width = FALSE)%
Credit scoring models are developed using supervised machine learning models. The data used must therefore have a label column to indicate good and bad that is binary in nature. This can in some instances be represented by 0 or 1.
The code below will produce the plot shown above.
data %>
? select(all_of(var_columns)) %>%
? summarise(
??? Numeric = sum(sapply(., is.numeric)),
??? Categorical = sum(sapply(., is.character)),
??? Binary = sum(sapply(., function(x) is.logical(x) && length(unique(x)) <= 2)),
??? Date = sum(sapply(., function(x) inherits(x, "Date") || inherits(x, "POSIXct")))
? ) %>%
? pivot_longer(everything(), names_to = "Data Type", values_to = "Count") %>%
? ggplot(aes(x = `Data Type`, y = Count, fill = `Data Type`)) +
? geom_bar(stat = "identity") +
? labs(x = "Data Type", y = "Count", title = "Data Type Counts for Variables Starting with 'var'") +
? scale_fill_manual(values = c("darkolivegreen3", "deepskyblue3", "darkseagreen", "darkslategray3", "bisque3")) +
? theme_minimal()%
Credit risk models are developed based on performance windows. These windows of observations are defined based on booking/underwriting months. For example if customer defaults (90 days or more commonly referred to as DPD Days Past Due) during the performance window, borrower would be considered as a 'bad' customer and labeled as 'event' in the dependent variable otherwise 'good ' and labeled as 'non-event.
Vintage Analysis
At this point it is important to perform a vintage analysis on the data extract. This will help unearth the following:
Roll Rate Analysis
The other analysis to conduct is a roll rate analysis. This is commonly used in loss forecasting and in determination of 'bad' customers (defaulters). It is used to define the threshold of delinquency to identify 'bad' customers. You can also use it to determine if to fully or partially charge off a loan.Finally this can be used to assess and determine cure rates timing.
Roll Rate Analysis can provide valuable inputs for the calculation of expected credit losses (ECL) under IFRS 9. By analyzing historical roll rates, financial institutions can gain insights into the probability of transitions between different credit states (e.g., from current to 30 days past due, from 30 days past due to charge-off) and use this information to estimate future credit losses.
Roll Rate Analysis can be used to develop probability of default (PD) models, which are one of the components of IFRS 9 impairment calculations. PD models estimate the likelihood of a borrower defaulting within a given time period.
Analysis by Origination Month
There were more loans issued in September than in the other 3 months. While sampling it is important to take this into account.
While the graph above is important as it shows the number of loans by booking month an analysis of bad rate by month of origination is necessary.
You can plot the above graphs using the following code.
#--------------- Inspect loan issuance by booking month
data %>%
? mutate(var41 = ymd(var41)) %>%
? mutate(good_bad = factor(ifelse(default_flag == 1, "Bad", "Good"))) %>%
? mutate(month_year = format(var41, "%m-%Y")) %>%
? count(month_year, good_bad) %>%
? group_by(month_year) %>%
? ungroup() %>%
? ggplot(aes(x = month_year, y = n, fill = good_bad)) +
? geom_bar(stat = "identity") +
? geom_text(aes(label = n), position = position_stack(vjust = 0.5), color = "white", size = 4, fontface = "bold") +
? labs(x = "Month-Year", y = "Count", fill = "Bad/Good") +
? ggtitle("Stacked Bar Chart of Goods and Bads by Month-Year") +
? theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
? scale_y_continuous(labels = comma_format()) +
? scale_fill_manual(values = c("#E7B800", "#00AFBB")) +
? theme_classic()
#--------------- Inspect Bad rates over time by booking month
data %>%
? mutate(var41 = ymd(var41)) %>%
? mutate(good_bad = factor(ifelse(default_flag == 1, "Bad", "Good"))) %>%
? mutate(month_year = format(var41, "%m-%Y")) %>%
? count(month_year, good_bad) %>%
? group_by(month_year) %>%
? mutate(percentage = n / sum(n)) %>%
? ungroup() %>%
? ggplot(aes(x = month_year, y = percentage, fill = good_bad)) +
? geom_bar(position = "fill", stat = "identity") +
? labs(x = "Month-Year", y = "Percentage", fill = "Bad/Good") +
? ggtitle("Monthly Analysis of Goods and Bads by Booking Month (Hypothetically)") +
? theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
? theme_classic() +? scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
? scale_fill_manual(values = c("coral1", "cyan2")) +
? geom_text(aes(label = scales::percent(percentage, accuracy = 1)),
??????????? position = position_fill(vjust = 0.5),
??????????? color = "black", size = 4, fontface = "bold")
In particular instances you will find that loans are classified by product codes, or there could be some form of coding that is numeric. Such codes that are numeric in nature could be interpreted as numerical variables which is not the case. A unique variable analysis is therefore important.
This can assist the scientist to filter out categorical variables much easily.
Below is a check of variable completeness, this is an analysis of missing values by variable. Different tasks could call for a certain threshold of missing values, however an understanding of a variable would inform on missing values to expect and how to treat missing values. Missing values can be imputed using mean or median values.
#---------- Count the number of unique values in each variabl
data %>%
? summarise(across(starts_with("var"), ~ length(unique(.)))) %>%
? pivot_longer(everything(), names_to = "Column", values_to = "Count") %>%
? mutate(Column = factor(Column, levels = var_columns)) %>%
? ggplot(aes(x = Column, y = Count)) +
? geom_bar(stat = "identity", fill = "darkgoldenrod2") +
? geom_text(aes(label = Count, y = Count), vjust = 0.5, color = "black", size = 3) +
? labs(x = "Column", y = "Count") +
? ggtitle("Count of Unique Values in Each Column (Starting with 'var')") +
? theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
? coord_flip() +
? theme_classic()
#---------------------- Check for completeness of the data
#---------- The data doesn't have incomplete records except for Variable 38
data %>%
? select(starts_with("var")) %>%
? summarise_all(~ sum(!is.na(.))/n()) %>%
? pivot_longer(everything(), names_to = "Variable", values_to = "Completeness") %>%
? # arrange(desc(Completeness)) %>%
? arrange(Completeness) %>%
? kbl() %>%
? kable_paper("hover", full_width = F)
Definition of Good Bad Label
As explained in the section above the days past due DPD on a facility help in determining the flag for good and bad. This is also informed by the structure of the product.
data %>
? mutate(good_bad = factor(ifelse(default_flag == 1, "Bad", "Good"))) %>%
? group_by(good_bad) %>%
? summarise(percent = n() / nrow(data)) %>%
? ggplot(aes(x = good_bad, y = percent, fill = good_bad)) +
? geom_bar(stat = "identity") +
? geom_text(aes(label = paste0(round(percent * 100, 1), "%")),
??????????? position = position_stack(vjust = 0.5),
??????????? vjust = -0.5, color = "black", fontface = "bold") +
? labs(x = "Category", y = "Percentage", title = "Distribution of Good vs. Bad") +
? scale_y_continuous(labels = percent_format()) +
? theme_minimal()
Relationship between Regressor and Response variable
It is good to subject explanatory variables both continuous and discrete variable to a distribution analysis in light of the target variable Y.
These plots can help spot instances where the distribution of bad is more than that of good and vice versa.
Plots of a batch for continuous variables.
Plots of a batch for discrete variables.
Plots of a batch for discrete variables.
These plots can be generated using the code below
#----------------------------------------------------------------------------------
#----------------------------- Analysis of Discrete variables
#-----------------------------------------------------------------------------------
variables <- c(paste0("var", 2:13), "var42")
data[, variables] <- lapply(data[, variables], factor)
# Create a list to store the plots
plots <- list()
# Create the stack bar plots for each variable
for (i in 1:length(variables)) {
? variable <- variables[i]
?
? plot <- data %>%
??? mutate(good_bad = factor(ifelse(default_flag == 1, "Bad", "Good"))) %>%
??? count(!!as.name(variable), good_bad) %>%
??? group_by(!!as.name(variable)) %>%
??? mutate(percentage = n / sum(n)) %>%
??? ungroup() %>%
??? ggplot(aes(x = !!as.name(variable), y = percentage, fill = good_bad)) +
??? geom_bar(position = "fill", stat = "identity") +
??? labs(x = variable, y = "Percentage", fill = "Category") +
??? ggtitle(paste("Stacked Bar Plot of", variable, "by Good/Bad")) +
??? theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
??? theme_classic() +
??? scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
??? scale_fill_manual(values = c("#E7B800", "#00AFBB")) +
??? geom_text(aes(label = scales::percent(percentage, accuracy = 1)),
????????????? position = position_fill(vjust = 0.5),
????????????? color = "black", size = 3.5, fontface = "bold")
?
? plots[[variable]] <- plot
?
? print(variable)
?
? # Check if 4 plots have been created or it's the last variable
? if (i %% 4 == 0 || i == length(variables)) {
??? # Arrange the plots in a grid
??? grid.arrange(grobs = plots, ncol = 2)
??? # Clear the list for the next set of plots
??? plots <- list()
? }
}-
Feature Engineering
Feature engineering enhances the effectiveness of machine learning models by carefully selecting and preparing the relevant features. This process involves choosing the most appropriate features for the model and transforming them in a manner that aligns with the requirements of the machine learning algorithm. By performing feature engineering, the performance and accuracy of the model can be improved, leading to better predictions and outcomes. The variables that undergo feature engineering are also known as independent variables, features, predictors, attributes, model factors, covariates, regressors, or characteristics.
This task can take many forms.
Feature Reduction:
Correlation
Below are some of highly correlated variables in the data. If these variables are left in the model they can cause problems in interpreting and drawing reliable conclusions from regression models. It is important to assess and address multicollinearity to ensure the validity of the results and the reliability of the model. Various techniques, such as variance inflation factor (VIF) analysis, can help detect and mitigate multicollinearity in regression analysis.
Some of the drawbacks of multicollinearity are:
领英推荐
Other methods of Feature Selection
Missing value threshold can be set as a method of feature selection. Features with number of missing values above a certain threshold are dropped. The next step is to bin variables.
Binning, also known as discretization, is a data preprocessing technique that involves dividing a continuous variable into a set of predefined bins or intervals. Instead of using the original continuous values, the data points are assigned to the corresponding bin based on their values. Binning can be useful in several scenarios:
However, it's important to consider the potential limitations and drawbacks of binning:
Binning can be a useful technique in certain situations, such as handling non-linear relationships or simplifying complex models. However, it is essential to carefully consider the potential drawbacks and evaluate the impact of binning on the analysis before applying it to the data.
Fine and Coarse classing
Fine and coarse classing are two approaches to binning or discretizing continuous variables. These approaches differ in the level of granularity or detail in which the continuous variable is to be divided into bins or categories.
Monotonic binning is a technique commonly used in credit scoring to discretize continuous variables while preserving their monotonic relationship with the target variable (e.g., default or non-default). The objective is to create bins that exhibit a consistent trend or monotonic pattern with the target variable across the range of values.
Bin evaluation: Calculate the default rate (or any other relevant metric) within each bin. Evaluate the monotonicity of the bins by analyzing the trend of the default rate across the bins. Ideally, the default rate should consistently increase or decrease monotonically as the values of the continuous variable increase. This leads to computation of WoE and Iv.
Information Value (IV) comes from information theory, it measures the predictive power of independent variables which is useful in feature selection.
We then transform all the independent variables using the weight of evidence (WoE) method. Based on the proportion of good applicants to bad applicants at each group level, this method measures the “strength” of grouping for differentiating good and bad risk, and attempts to find a monotonic relationship between the independent variables and the target variable.
According to Siddiqi (2006), by convention, the values of the IV statistic in credit scoring can be interpreted as follows:
The final step is to create two data sets one for Bin transformation and another one for WoE transformation.
# Convert factor variables to numeric representatio
for (col in names(data)[sapply(data, is.factor)]) {
? data[[col]] <- as.integer(data[[col]])
}
# Correlation
cor_matrix <- cor(data[, startsWith(names(data), "var")], method = "pearson")
# Find attributes that are highly correlated
highlyCorrelated <- findCorrelation(cor_matrix, cutoff = 0.7)
print(colnames(data[, startsWith(names(data), "var")])[highlyCorrelated])
#- Plot
corrplot::corrplot(cor(data[, colnames(data[, startsWith(names(data), "var")])[highlyCorrelated]], method = "pearson"),
?????????????????? method = "color",
?????????????????? type = "lower",
?????????????????? tl.col = "black",
?????????????????? tl.srt = 45)
#----------------- Uncorrelated Data for Iv computation
uncor_data <- data[, -highlyCorrelated]
#---------------------------------------------------------------------------------------------------------------------------------------------
#--------------- Variable selection by Iv
#--------------------------------------------------------------------------------------------------------------------------------------------
variables <- variables[variables %in% colnames(uncor_data)]
uncor_data[, variables] <- lapply(uncor_data[, variables], factor)
# Bin the data
binning <- woe.tree.binning(subset(uncor_data[, c(2:ncol(uncor_data), 1)], select = -loanid), 'default_flag', uncor_data, event.class = 0)
# woe.binning.plot(binning, multiple.plots = FALSE)
uncor_data_binned <- woe.binning.deploy(uncor_data, binning, min.iv.total = 0.1, add.woe.or.dum.var='woe')
uncor_data_binned <- uncor_data_binned %>%
? mutate_at(vars(starts_with("woe")), ~ . / 100)
# Tabulate the binned variables
tabulate_binning <- woe.binning.table(binning)
tabulate_binning
model_data <- subset(uncor_data_binned, select = grepl("binned|default_flag", names(uncor_data_binned)))
colnames(model_data) <- gsub("\\.", "_", colnames(model_data))
You could also create the bins manually and compute the Woe using the script below.
#-----------------------------------------------------------------------------------------------------------------------------------
#--------------------- Function to compute Woe and Iv manually
#------------------------------------------------------------------------------------------------------------------------------------
tbl_fun <- function(x, y) {
? options(scipen = 999)
? y_recode <- ifelse(y == 0, "Good", "Bad")
? Mat <- as.matrix(table(as.factor(x), as.factor(y_recode)))
? Total <- Mat[, 1] + Mat[, 2]?? ?
? Total_Pct <- percent(Total / sum(Mat), accuracy = 0.01)
? Bad_Pct <- percent((Mat[, 1] / sum(Mat[, 1])), accuracy = 0.01)
? Good_Pct <- percent((Mat[, 2] / sum(Mat[, 2])), accuracy = 0.01)
? Bad_Rate <- percent(round((Mat[, 1] / (Mat[, 1] + Mat[, 2])), 2), accuracy = 0.01)
? WOE <- round(log((Mat[, 2] / sum(Mat[, 2])) / (Mat[, 1] / sum(Mat[, 1]))), 4)
? g_b_comp <- ifelse(Mat[, 1] == Mat[, 2], 0, 1)
? IV <- round(ifelse(g_b_comp == 0, 0, ((Mat[, 2] / sum(Mat[, 2])) - (Mat[, 1] / sum(Mat[, 1]))) * (WOE / 10)), 4)
? Total_IV <- sum(IV)
? tbl <- as.data.frame(cbind(Bins = rownames(Mat), Total, Total_Pct, Good = Mat[, 2], Good_Pct, Bad = Mat[, 1], Bad_Pct, Bad_Rate, WOE, IV, Total_IV))
? rownames(tbl) <- NULL
? return(tbl)
}
tbl_fun(select(model_data, starts_with("var"), default_flag)$var8_binned,
??????? select(model_data, starts_with("var"), default_flag)$default_flag)
tbl_fun(select(model_data, starts_with("var"), default_flag)$var43_binned,
??????? select(model_data, starts_with("var"), default_flag)$default_flag)
This is the second batch of the code.
plot1 <- model_data %>
? mutate(default_flag = ifelse(default_flag == 0, "Good", "Bad"),
???????? var19_binned = as.character(var19_binned)) %>%
? na.omit() %>%
? group_by(var19_binned, default_flag) %>%
? summarise(count = n()) %>%
? ggplot(aes(x = var19_binned, y = count, fill = default_flag)) +
? geom_col(position = "stack") +
? labs(x = "Credit History", y = "Count", fill = "Good-Bad") +
? ggtitle("Credit History for var43 After Binning") +
? theme_minimal()
plot2 <- model_data %>%
? mutate(default_flag = ifelse(default_flag == 0, "Good", "Bad"),
???????? Bins = as.character(var19_binned)) %>%
? na.omit() %>%
? group_by(Bins, default_flag, .groups = 'drop') %>%
? summarise(count = n()) %>%
? pivot_wider(names_from = default_flag, values_from = count) %>%
? mutate(Bad_Rate = Bad / (Bad + Good)) %>%
? ggplot(aes(x = Bins, y = Bad_Rate, group = 1)) +
? geom_line(size = 1.5, color = "blue") +
? geom_text(aes(label = scales::percent(Bad_Rate, accuracy = 0.01)),
??????????? vjust = -1.2, size = 3.5, fontface = "bold") +
? labs(x = "Bins", y = "Bad Rate") +
? ggtitle("Bad Rate by Bins for Var43") +
? theme_minimal() +
? scale_y_continuous(labels = scales::percent)
combined_plot <- plot1 + plot2
# Arrange the plots side by side
combined_plot <- combined_plot +
? plot_layout(ncol = 2)
# Display the combined plot
combined_plot%
Below are some of the plots generated by the scripts above.
Count of observations per bin and the corresponding bad rates per bin.
Below is the WoE plot by bins
This is the script for the plot above.
woe_plot %>
? slice(-n()) %>%
? mutate(WOE = as.numeric(WOE)) %>%
? ggplot(aes(x = Bins, y = WOE, fill = factor(sign(WOE)))) +
? geom_col(position = "stack") +
? labs(x = "Bins", y = "WoE", fill = "WoE") +
? ggtitle("WoE plot for the Bins") +
? theme_minimal() +
? scale_fill_manual(values = c("#E7B800", "#00AFBB"),
??????????????????? labels = c("Positive", "Negative"),
??????????????????? guide = guide_legend(reverse = TRUE))
Monotonic binning means the relationship between the variable being binned and the target variable follows a consistent pattern. As the values of independent variable increase or decrease, the values of the target variable also consistently increase or decrease respectively. This is achieved through coarse classing.
wpa package can help achieve this through the script below.
IV_report
? data = uncor_data[, -which(names(uncor_data) == "loanid")],
? predictors = NULL,
? outcome = "default_flag",
? bins = 5,
? max_var = 9,
? path = "IV report",
? timestamp = TRUE
)
The package wpa would give a WoE plot that is monotonic as shown below. Then the final part would be to combine bin 440 through to 525.
Split the data into Training and Test
For modelling purposes the data is split into Train and Test. One set is to be used for training and the other one is to be used for validation. You can also explore K-Fold cross validation technique where you do not need to split the data.
#-------------------------------------------------------------------------------------------------------------------------------------
#------------------ Split the data into Training and Test
# Set the seed for reproducibility
raw_data <- uncor_data_binned[, c("default_flag", sub("_(.*)", "", colnames(select(model_data, starts_with("var")))))]
bins <- woebin(raw_data,
?????????????? y = "default_flag",
?????????????? positive = "bad|1",
?????????????? method = "tree");bins[1]
raw_data <- raw_data[,c("default_flag",names(bins))]
#--------------------------------------------------------------------------------------------------------------------------------------
set.seed(77)
# Perform stratified sampling
train_indices <- createDataPartition(subset(model_data, select = grepl("woe|default_flag", names(model_data)))$default_flag,
???????????????????????????????????? p = 0.8, list = FALSE)
# Split the data into train and test sets based on the indices
train_data <- subset(model_data, select = grepl("woe|default_flag", names(model_data)))[train_indices, ]
test_data <- subset(model_data, select = grepl("woe|default_flag", names(model_data)))[-train_indices, ]
train_raw <- raw_data[train_indices, ]
test_raw <- raw_data[-train_indices, ]
#--------------------------------------------------------------------------------------------------------------------------------------
#-------------------------- Fit a Logistic Regression Model to WOE values
#--------------------------------------------------------------------------------------------------------------------------------------
#--------------- Model 1
colnames(train_data) <- c("default_flag", paste0(sub("^[^_]*_([^_]+)_.*$", "\\1", colnames(train_data))[-1],"_woe"))
formula? <- as.formula(paste("I(default_flag == 1) ~", paste(colnames(select(train_data, ends_with("woe"))), collapse = " + ")))
formula
model_1 <- glm(formula, data = train_data, family = "binomial")
logit_model <- step(model_1, direction = "both", trace = FALSE)
summary(logit_model)
#-------------- Model 2
train_woe <- woebin_ply(train_raw, bins)
formula? <- as.formula(paste("I(default_flag == 1) ~", paste(colnames(select(train_woe, ends_with("woe"))), collapse = " + ")))
formula
null <- glm(default_flag ~ 1, data = train_woe, family = binomial)
full <- glm(formula, data = train_woe, family = binomial)
bicglm <- step(null, scope=formula(full), direction = "both", k = log(nrow(train_woe)))
model <- glm(formula, family = binomial(), data = train_woe)
logistic_step <- step(model, direction = "both", trace = FALSE)
summary(logistic_step)
pred = predict(logistic_step)
y_pred_prob = predict(logistic_step, type = 'response')
After fitting data to the model. It is important to check the following, all p-values are less than 0.05, coefficients are positive and there are no signs of multicolleniarity. If multicolleniarity is detected then you can consider.
The final results looks like this:
The variables used in the final model are not highly correlated.
Gini and AUC
Interpretation of what AUC and Gini to be considered acceptable performance depends on the specific context, the nature of the problem, and the performance standards set by the domain or industry.
In general, an AUC value of 0.66 is considered reasonable and suggests that the model has some discriminatory power to distinguish between the positive and negative classes. An AUC of 0.5 represents a model with no discrimination (similar to random guessing), while an AUC of 1.0 indicates a perfect classification model. A Gini coefficient of 0.32 suggests a moderate level of discrimination by the model but indicates room for improvement.
Generating a Scorecard
Convert the model coefficients into a simplified scoring system, known as the scorecard. The scorecard assigns points to each predictor variable based on their coefficients, reflecting their relative importance in predicting creditworthiness. The scaling process involves determining the range and granularity of the scorecard, such as assigning points to specific ranges of values or categories.
#-------------------------------------------------------------------------------------------------------------------------------
# Manual scorecard Generation
#--------------------------------------------------------------------------------------------------------------------------------
#=---------------- Get the coefficients from the model
coefficients <- coef(logistic_step);coefficients
#=------------------ Define parameters for the scorecard
points0 <- 600
pdo <- 20
odds0 <- 50
n <- length(model$coefficients[-1]);n
factor <- pdo / log(2)
offset <- points0 - factor * log(odds0)
alpha <- coefficients[1]
BaseScore <- (offset - (factor * coefficients[1]));round(BaseScore,0)
The score card is generated using the script below.
# Create an empty data frame for the scorecar
card <- data.frame(Variable = character(),
?????????????????? Bin = character(),
?????????????????? Woe = numeric(),
?????????????????? Iv = numeric(),
?????????????????? Coef = numeric(),
?????????????????? stringsAsFactors = FALSE)
#------------- Calculate the score for each bin of each variable
suppressWarnings({
for (var in sub("^(.*?)_.*$", "\\1", colnames(train_data)[-1])) {
? var_tbl <- bins[var]
? variable <- var_tbl$var$variable
? bin <- var_tbl$var$bin
? woe <- var_tbl$var$woe
? iv <- var_tbl$var$bin_iv
? coefficient <- coefficients[paste0(var, "_woe")]
? card <- rbind(card, data.frame(Variable = variable, Bin = bin, Woe = woe, Iv = iv, Coef = coefficient))
? print(var)
}
})
card$Neutral <- round(-(alpha/n)*factor+offset/n,0)
card$ScoreCntr <- round(card$Coef * card$Woe,4)
card$Score <-? round((card$ScoreCntr + (alpha/n))*factor + offset/n,0)
head(card,4)
From the scorecard above total score per customer decreases with probability or likelihood of default.
When scaling the model into a scorecard, logistic regression coefficients from model fitting as well as the transformed WoE values are required. Subsequently you need to convert the score from the log-odds unit to a points system.
From the script above base score was set at 529
BaseScore <- (offset - (factor * coefficients[1]))
This is the resulting score card.
Determination of Cutoff Score
The choice of cut-off score depends on the specific requirements, risk appetite, and business objectives of the organization implementing the credit scoring model. Different methods may be used in combination, and adjustments to the cut-off score may be made over time based on performance monitoring and feedback.
Ultimately, the determination of the cut-off score involves finding a balance between minimizing credit risk and maximizing business opportunities while considering various factors, including model performance, business objectives, and regulatory considerations.
You can use the gain lift table to set the cutoff score. The gain lift table is constructed by dividing the population into deciles or groups based on the predicted probabilities or scores from the credit scoring model.
The optimal cut-off score on the score scale can be obtained by examining the KS curve maximum vertical distance (i.e., the largest gap) between the cumulative good percentage and the cumulative bad rate. This point corresponds to the optimal cut-off score. Observations with scores above the cut-off score will be classified as "good," while those below the cut-off score will be classified as "bad."
The cut-off score involves a trade-off between the false positive rate (classifying a "good" observation as "bad") and the false negative rate (classifying a "bad" observation as "good"). The specific choice of the cut-off score will depend on the desired balance between these two types of errors, which may vary depending on the application and the costs associated with misclassifications.
y = train_raw$default_fla
pred = predict(logistic_step)
y_pred_prob = predict(logistic_step, type = 'response')
resp = predict(logistic_step, type = 'response')
res = tibble(logit = pred
???????????? , y_pred_prob = y_pred_prob
???????????? , y_actual = y
???????????? , odds = exp(pred)
???????????? , prob = odds / (odds + 1)
???????????? , prob_ctrl = resp )
sc = scorecard_ply(train_raw, score_card)
res$score = sc[[1]]
res$score_ctrl = offset - factor * res$logit
res
ks.test(res$y_pred_prob, res$y_actual, alternative='greater')
df_summary <- res %>%
? arrange(score) %>%
? mutate(cum_bad = cumsum(y_actual) / sum(y_actual),
???????? cum_good = cumsum(1 - y_actual) / sum(1 - y_actual))
max_ks_score <- df_summary$score[which.max(abs(df_summary$cum_bad - df_summary$cum_good))]
ggplot(df_summary, aes(x = score)) +
? geom_line(aes(y = cum_bad), color = "red", size = 1, linetype = "solid") +
? geom_line(aes(y = cum_good), color = "blue", size = 1, linetype = "solid") +
? geom_vline(aes(xintercept = max_ks_score), color = "orange",
???????????? linetype = "dashed", size = 1.5) +
? annotate("text", x = max_ks_score,
?????????? y = max(df_summary$cum_bad, df_summary$cum_good) + 0.05,
?????????? label = paste0("KS =", round(max(abs(df_summary$cum_bad - df_summary$cum_good)), 3)), color = "black") +
? annotate("text", x = max_ks_score,
?????????? y = -Inf,
?????????? label = paste0("x =", round(max_ks_score, 3)),
?????????? vjust = -1, color = "black") +
? labs(x = "Score Cutoff", y = "Cumulative Rate", title = "KS Plot") +
? theme_minimal()
Additionally, it's important to validate the chosen cut-off score on an independent dataset or through cross-validation to ensure its generalization and performance.