Dragon Link slot Machine near Me.Enjoy Free 888+200 Daily Legal Bonus

A credit score is a numerical representation of an individual's creditworthiness, commonly used by credit lending institutions like banks, telecom and fintechs. These organizations employ complex credit models that consider various factors that run from customer demographic details, their sources of income, credit commitments, and past loan performance to calculate a credit score.

Credit scoring models can broadly be categorized into:

Behavioral Scorecards
Application Scorecards

A credit score indicates the likelihood of timely repayment if a loan or credit card is advanced to the person. Overtime different scorecards have been developed to address specific needs of the business for example:

lending scorecards
collection scorecards
fraud scorecards
risk-based pricing scorecards

They are used by lenders to determine the interest rates and terms offered to borrowers. Each scorecard is designed to assess specific aspects of credit risk and aid in decision-making processes related to lending, collections, fraud detection, or pricing.

EDA - Exploratory Data Analysis

This is the initial exploration and analysis of the credit-related data to gain insights and understanding of the variables and their relationships. It helps to identify patterns, trends, outliers, and potential issues within the data. R software has fantastic packages that automates this tedious task. Some of these libraries are:

SmartEDA
DataExplorer
AutoEDA
wpa
scoringTools

There are many others but these are my personal favorite.

Step 1: Prepare the environment.

R software is open source and offers a rich collection of packages for credit scoring. Before you start your project remember to:

# Clear all objects including hidden objects

rm(list = ls(all.names = TRUE))

# Free up memory and report the memory usage.

gc()

The you can load the desired packages

library(ggplot2
library(pROC)
library(reshape2)
library(dplyr)
library(scales)
library(tidyr)
library(kableExtra)
library(knitr)
library(data.table)
library(ggpubr)
library(gridExtra)
library(lubridate)
library(SmartEDA)
library(caret)
library(randomForest)
library(scorecard)
library(pacman)
library(patchwork)
library(woeBinning))

You need to load the data in R and perform initial ETL that entails handling dates and other data types during the loading process. There are various ways to load the data into R depending on the source of the data. For simplicity purpose I will assume my data is a csv file.

data <- read.csv("hw_table.csv", header = TRUE, stringsAsFactors = FALSE
#---------------- Format Dates
data$var41 <- as.POSIXct(data$var41, format = "%Y-%m-%d")
data$var38 <- as.POSIXct(data$var38, format = "%Y-%m-%d"))

Next we create a brief description of our data set just to understand what we are dealing with:

The code below creates the table above, I have used the pipe coding format that allows one achieve a task in a block of code that is easy to comprehend.

data %>
? SmartEDA::ExpData(type = 1) %>%
? kbl() %>%
? kable_minimal() %>%
? kable_paper("hover", full_width = FALSE)%

Credit scoring models are developed using supervised machine learning models. The data used must therefore have a label column to indicate good and bad that is binary in nature. This can in some instances be represented by 0 or 1.

The code below will produce the plot shown above.

data %>
? select(all_of(var_columns)) %>%
? summarise(
??? Numeric = sum(sapply(., is.numeric)),
??? Categorical = sum(sapply(., is.character)),
??? Binary = sum(sapply(., function(x) is.logical(x) && length(unique(x)) <= 2)),
??? Date = sum(sapply(., function(x) inherits(x, "Date") || inherits(x, "POSIXct")))
? ) %>%
? pivot_longer(everything(), names_to = "Data Type", values_to = "Count") %>%
? ggplot(aes(x = `Data Type`, y = Count, fill = `Data Type`)) +
? geom_bar(stat = "identity") +
? labs(x = "Data Type", y = "Count", title = "Data Type Counts for Variables Starting with 'var'") +
? scale_fill_manual(values = c("darkolivegreen3", "deepskyblue3", "darkseagreen", "darkslategray3", "bisque3")) +
? theme_minimal()%

Credit risk models are developed based on performance windows. These windows of observations are defined based on booking/underwriting months. For example if customer defaults (90 days or more commonly referred to as DPD Days Past Due) during the performance window, borrower would be considered as a 'bad' customer and labeled as 'event' in the dependent variable otherwise 'good ' and labeled as 'non-event.

Vintage Analysis

At this point it is important to perform a vintage analysis on the data extract. This will help unearth the following:

Identify if loans issued in a particular month or quarter are riskier than others.
Determine the optimal period of performance window in development of scorecard.
Monitor or track risk of a portfolio.
Estimate minimum months required after that we can cross-sell to new customers.
Forecast risk and can also be used in stress testing.

Roll Rate Analysis

The other analysis to conduct is a roll rate analysis. This is commonly used in loss forecasting and in determination of 'bad' customers (defaulters). It is used to define the threshold of delinquency to identify 'bad' customers. You can also use it to determine if to fully or partially charge off a loan.Finally this can be used to assess and determine cure rates timing.

Roll Rate Analysis can provide valuable inputs for the calculation of expected credit losses (ECL) under IFRS 9. By analyzing historical roll rates, financial institutions can gain insights into the probability of transitions between different credit states (e.g., from current to 30 days past due, from 30 days past due to charge-off) and use this information to estimate future credit losses.

Roll Rate Analysis can be used to develop probability of default (PD) models, which are one of the components of IFRS 9 impairment calculations. PD models estimate the likelihood of a borrower defaulting within a given time period.

Analysis by Origination Month

There were more loans issued in September than in the other 3 months. While sampling it is important to take this into account.

While the graph above is important as it shows the number of loans by booking month an analysis of bad rate by month of origination is necessary.

You can plot the above graphs using the following code.

#--------------- Inspect loan issuance by booking month
data %>%
? mutate(var41 = ymd(var41)) %>%
? mutate(good_bad = factor(ifelse(default_flag == 1, "Bad", "Good"))) %>%
? mutate(month_year = format(var41, "%m-%Y")) %>%
? count(month_year, good_bad) %>%
? group_by(month_year) %>%
? ungroup() %>%
? ggplot(aes(x = month_year, y = n, fill = good_bad)) +
? geom_bar(stat = "identity") +
? geom_text(aes(label = n), position = position_stack(vjust = 0.5), color = "white", size = 4, fontface = "bold") +
? labs(x = "Month-Year", y = "Count", fill = "Bad/Good") +
? ggtitle("Stacked Bar Chart of Goods and Bads by Month-Year") +
? theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
? scale_y_continuous(labels = comma_format()) +
? scale_fill_manual(values = c("#E7B800", "#00AFBB")) +
? theme_classic()

#--------------- Inspect Bad rates over time by booking month
data %>%
? mutate(var41 = ymd(var41)) %>%
? mutate(good_bad = factor(ifelse(default_flag == 1, "Bad", "Good"))) %>%
? mutate(month_year = format(var41, "%m-%Y")) %>%
? count(month_year, good_bad) %>%
? group_by(month_year) %>%
? mutate(percentage = n / sum(n)) %>%
? ungroup() %>%
? ggplot(aes(x = month_year, y = percentage, fill = good_bad)) +
? geom_bar(position = "fill", stat = "identity") +
? labs(x = "Month-Year", y = "Percentage", fill = "Bad/Good") +
? ggtitle("Monthly Analysis of Goods and Bads by Booking Month (Hypothetically)") +
? theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
? theme_classic() +? scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
? scale_fill_manual(values = c("coral1", "cyan2")) +
? geom_text(aes(label = scales::percent(percentage, accuracy = 1)),
??????????? position = position_fill(vjust = 0.5),
??????????? color = "black", size = 4, fontface = "bold")

In particular instances you will find that loans are classified by product codes, or there could be some form of coding that is numeric. Such codes that are numeric in nature could be interpreted as numerical variables which is not the case. A unique variable analysis is therefore important.

This can assist the scientist to filter out categorical variables much easily.

Below is a check of variable completeness, this is an analysis of missing values by variable. Different tasks could call for a certain threshold of missing values, however an understanding of a variable would inform on missing values to expect and how to treat missing values. Missing values can be imputed using mean or median values.

#---------- Count the number of unique values in each variabl
data %>%
? summarise(across(starts_with("var"), ~ length(unique(.)))) %>%
? pivot_longer(everything(), names_to = "Column", values_to = "Count") %>%
? mutate(Column = factor(Column, levels = var_columns)) %>%
? ggplot(aes(x = Column, y = Count)) +
? geom_bar(stat = "identity", fill = "darkgoldenrod2") +
? geom_text(aes(label = Count, y = Count), vjust = 0.5, color = "black", size = 3) +
? labs(x = "Column", y = "Count") +
? ggtitle("Count of Unique Values in Each Column (Starting with 'var')") +
? theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
? coord_flip() +
? theme_classic()

#---------------------- Check for completeness of the data
#---------- The data doesn't have incomplete records except for Variable 38
data %>%
? select(starts_with("var")) %>%
? summarise_all(~ sum(!is.na(.))/n()) %>%
? pivot_longer(everything(), names_to = "Variable", values_to = "Completeness") %>%
? # arrange(desc(Completeness)) %>%
? arrange(Completeness) %>%
? kbl() %>%
? kable_paper("hover", full_width = F)

Definition of Good Bad Label

As explained in the section above the days past due DPD on a facility help in determining the flag for good and bad. This is also informed by the structure of the product.

data %>
? mutate(good_bad = factor(ifelse(default_flag == 1, "Bad", "Good"))) %>%
? group_by(good_bad) %>%
? summarise(percent = n() / nrow(data)) %>%
? ggplot(aes(x = good_bad, y = percent, fill = good_bad)) +
? geom_bar(stat = "identity") +
? geom_text(aes(label = paste0(round(percent * 100, 1), "%")),
??????????? position = position_stack(vjust = 0.5),
??????????? vjust = -0.5, color = "black", fontface = "bold") +
? labs(x = "Category", y = "Percentage", title = "Distribution of Good vs. Bad") +
? scale_y_continuous(labels = percent_format()) +
? theme_minimal()

Relationship between Regressor and Response variable

It is good to subject explanatory variables both continuous and discrete variable to a distribution analysis in light of the target variable Y.

These plots can help spot instances where the distribution of bad is more than that of good and vice versa.

Plots of a batch for continuous variables.

Plots of a batch for discrete variables.

These plots can be generated using the code below

#----------------------------------------------------------------------------------
#----------------------------- Analysis of Discrete variables
#-----------------------------------------------------------------------------------
variables <- c(paste0("var", 2:13), "var42")
data[, variables] <- lapply(data[, variables], factor)
# Create a list to store the plots
plots <- list()
# Create the stack bar plots for each variable
for (i in 1:length(variables)) {
? variable <- variables[i]
?
? plot <- data %>%
??? mutate(good_bad = factor(ifelse(default_flag == 1, "Bad", "Good"))) %>%
??? count(!!as.name(variable), good_bad) %>%
??? group_by(!!as.name(variable)) %>%
??? mutate(percentage = n / sum(n)) %>%
??? ungroup() %>%
??? ggplot(aes(x = !!as.name(variable), y = percentage, fill = good_bad)) +
??? geom_bar(position = "fill", stat = "identity") +
??? labs(x = variable, y = "Percentage", fill = "Category") +
??? ggtitle(paste("Stacked Bar Plot of", variable, "by Good/Bad")) +
??? theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
??? theme_classic() +
??? scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
??? scale_fill_manual(values = c("#E7B800", "#00AFBB")) +
??? geom_text(aes(label = scales::percent(percentage, accuracy = 1)),
????????????? position = position_fill(vjust = 0.5),
????????????? color = "black", size = 3.5, fontface = "bold")
?
? plots[[variable]] <- plot
?
? print(variable)
?
? # Check if 4 plots have been created or it's the last variable
? if (i %% 4 == 0 || i == length(variables)) {
??? # Arrange the plots in a grid
??? grid.arrange(grobs = plots, ncol = 2)
??? # Clear the list for the next set of plots
??? plots <- list()
? }
}-

Feature Engineering

Feature engineering enhances the effectiveness of machine learning models by carefully selecting and preparing the relevant features. This process involves choosing the most appropriate features for the model and transforming them in a manner that aligns with the requirements of the machine learning algorithm. By performing feature engineering, the performance and accuracy of the model can be improved, leading to better predictions and outcomes. The variables that undergo feature engineering are also known as independent variables, features, predictors, attributes, model factors, covariates, regressors, or characteristics.

This task can take many forms.

Creating Interaction Terms: This involves combining two or more existing features to create new features that capture the interaction or relationship between them. This can be done using arithmetic operations or statistical functions.
Encoding Categorical Variables: Categorical variables need to be converted into numerical representations for many machine learning algorithms. Techniques like one-hot encoding, label encoding, or target encoding can be used to transform categorical variables into numeric features.
Handling Missing Values: Missing data can be problematic for machine learning models. Techniques such as imputation (e.g., mean, median, mode imputation), deletion of missing values, or advanced methods like multiple imputation can be employed to handle missing data effectively.
Scaling and Normalization: Scaling or normalizing numerical features ensures that they are on a similar scale, preventing certain features from dominating the model. Common scaling techniques include standardization (mean centering and scaling to unit variance) and min-max scaling.

Feature Reduction:

Principal Component Analysis (PCA): PCA is a dimensionality reduction technique that identifies the most important patterns in the data by transforming the original features into a smaller set of uncorrelated variables called principal components. These components capture the maximum variance in the data.
Recursive Feature Elimination (RFE): RFE is an iterative feature selection method that works by recursively eliminating less important features based on their impact on model performance. It uses a machine learning model and successively removes features with the lowest importance until a desired number of features is reached.
Feature Importance: Many machine learning algorithms provide a measure of feature importance. For example, Random Forests or Gradient Boosting models can rank features based on their contribution to model performance. Features with low importance can be removed or given less priority.
Univariate Feature Selection: This approach involves selecting features based on their individual relationship with the target variable. Statistical tests, such as chi-square test for categorical variables or correlation coefficients for numerical variables, can be used to select the most relevant features.

Correlation

Below are some of highly correlated variables in the data. If these variables are left in the model they can cause problems in interpreting and drawing reliable conclusions from regression models. It is important to assess and address multicollinearity to ensure the validity of the results and the reliability of the model. Various techniques, such as variance inflation factor (VIF) analysis, can help detect and mitigate multicollinearity in regression analysis.

Some of the drawbacks of multicollinearity are:

Unreliable coefficient estimates: Multicollinearity makes it difficult to determine the true relationship between each independent variable and the dependent variable. The coefficients may be imprecise, and small changes in the data can lead to large variations in the estimated coefficients.
Reduced statistical significance: Multicollinearity inflates the standard errors of the coefficients, which reduces the t-statistics and p-values associated with the variables. As a result, variables that might be individually significant in a model with no multicollinearity may become statistically insignificant when multicollinearity is present.
Ambiguous interpretation of coefficients: When variables are highly correlated, it becomes challenging to interpret the coefficients accurately. In the presence of multicollinearity, the coefficients may reflect the combined effects of the correlated variables rather than their individual contributions to the model.
Inefficient allocation of coefficients: Multicollinearity can lead to coefficients being assigned to variables that do not contribute significantly to the dependent variable. This inefficient allocation of coefficients may lead to misleading conclusions about the importance of different variables in explaining the outcome.
Increased standard errors: Multicollinearity inflates the standard errors of the regression coefficients, which reduces the precision of the estimates. Wide confidence intervals around the coefficients make it difficult to draw reliable inferences and may weaken the predictive power of the model.
Instability of coefficients: Multicollinearity makes the estimated coefficients highly sensitive to small changes in the data. Including or excluding a variable or even minor changes in the data set can lead to large variations in the coefficients, making it difficult to replicate or generalize the results.
Difficulties in identifying influential variables: Multicollinearity can mask the true effects of individual variables and make it challenging to identify the most important predictors in a regression model. The correlation between variables can obscure the independent contributions of each variable to the dependent variable.

Other methods of Feature Selection

Missing value threshold can be set as a method of feature selection. Features with number of missing values above a certain threshold are dropped. The next step is to bin variables.

Binning, also known as discretization, is a data preprocessing technique that involves dividing a continuous variable into a set of predefined bins or intervals. Instead of using the original continuous values, the data points are assigned to the corresponding bin based on their values. Binning can be useful in several scenarios:

Handling non-linear relationships: Binning can be used to capture non-linear relationships between a continuous independent variable and the dependent variable. By dividing the variable into bins, you can model different ranges of values separately and potentially capture different patterns or relationships within each bin.
Simplifying complex models: Binning can help simplify complex models by converting continuous variables into categorical variables. Categorical variables can be easier to interpret and may reduce the risk of overfitting. Binning can also help in reducing the complexity of models that have limited data or suffer from the curse of dimensionality.
Dealing with outliers: Binning can mitigate the impact of outliers by grouping extreme values into the same bin. This can help in reducing the influence of outliers on the model's performance and make it more robust to extreme values.
Addressing data sparsity: In situations where there are sparse data points in certain regions of a continuous variable, binning can be used to aggregate those data points into a single bin. This can help overcome issues related to limited data availability and improve the stability of the analysis.
Meeting specific business requirements: Binning can be useful when specific business requirements or constraints call for discretizing continuous variables. For example, in credit scoring, lenders may have specific risk categories based on income levels or credit scores.

However, it's important to consider the potential limitations and drawbacks of binning:

Information loss: Binning reduces the granularity of the data by collapsing values into bins. This can result in information loss and may lead to a loss of precision in the analysis.
Arbitrary bin boundaries: The choice of bin boundaries is subjective and can affect the results. Different binning strategies may lead to different interpretations and outcomes. Careful consideration should be given to selecting appropriate bin boundaries that align with the data and the objectives of the analysis.
Increased dimensionality: Binning can increase the dimensionality of the data, especially when multiple variables are binned simultaneously. This can have implications for subsequent analyses, such as increasing the number of variables in a regression model.
Sensitivity to bin boundaries: Binning can introduce sensitivity to the specific bin boundaries chosen. Small changes in the binning scheme may lead to different results or conclusions. It is important to evaluate the stability and robustness of the analysis with respect to different binning strategies.

Binning can be a useful technique in certain situations, such as handling non-linear relationships or simplifying complex models. However, it is essential to carefully consider the potential drawbacks and evaluate the impact of binning on the analysis before applying it to the data.

Fine and Coarse classing

Fine and coarse classing are two approaches to binning or discretizing continuous variables. These approaches differ in the level of granularity or detail in which the continuous variable is to be divided into bins or categories.

Fine classing: Fine classing involves dividing the continuous variable into a larger number of smaller bins or categories, resulting in a more detailed representation of the data. This approach provides a higher level of granularity and can capture finer patterns or variations within the data. Fine classing is suitable when the data has a wide range and requires a more precise representation.
Coarse classing: Coarse classing, on the other hand, involves dividing the continuous variable into a smaller number of larger bins or categories, resulting in a less detailed representation of the data. This approach provides a broader overview and simplifies the data by grouping similar values together. Coarse classing is suitable when the data has a large range and a higher level of detail is not necessary or practical.

Monotonic binning is a technique commonly used in credit scoring to discretize continuous variables while preserving their monotonic relationship with the target variable (e.g., default or non-default). The objective is to create bins that exhibit a consistent trend or monotonic pattern with the target variable across the range of values.

Bin evaluation: Calculate the default rate (or any other relevant metric) within each bin. Evaluate the monotonicity of the bins by analyzing the trend of the default rate across the bins. Ideally, the default rate should consistently increase or decrease monotonically as the values of the continuous variable increase. This leads to computation of WoE and Iv.

Information Value (IV) comes from information theory, it measures the predictive power of independent variables which is useful in feature selection.

We then transform all the independent variables using the weight of evidence (WoE) method. Based on the proportion of good applicants to bad applicants at each group level, this method measures the “strength” of grouping for differentiating good and bad risk, and attempts to find a monotonic relationship between the independent variables and the target variable.

According to Siddiqi (2006), by convention, the values of the IV statistic in credit scoring can be interpreted as follows:

The final step is to create two data sets one for Bin transformation and another one for WoE transformation.

# Convert factor variables to numeric representatio
for (col in names(data)[sapply(data, is.factor)]) {
? data[[col]] <- as.integer(data[[col]])
}
# Correlation
cor_matrix <- cor(data[, startsWith(names(data), "var")], method = "pearson")
# Find attributes that are highly correlated
highlyCorrelated <- findCorrelation(cor_matrix, cutoff = 0.7)
print(colnames(data[, startsWith(names(data), "var")])[highlyCorrelated])
#- Plot
corrplot::corrplot(cor(data[, colnames(data[, startsWith(names(data), "var")])[highlyCorrelated]], method = "pearson"),
?????????????????? method = "color",
?????????????????? type = "lower",
?????????????????? tl.col = "black",
?????????????????? tl.srt = 45)
#----------------- Uncorrelated Data for Iv computation
uncor_data <- data[, -highlyCorrelated]
#---------------------------------------------------------------------------------------------------------------------------------------------
#--------------- Variable selection by Iv
#--------------------------------------------------------------------------------------------------------------------------------------------
variables <- variables[variables %in% colnames(uncor_data)]
uncor_data[, variables] <- lapply(uncor_data[, variables], factor)
# Bin the data
binning <- woe.tree.binning(subset(uncor_data[, c(2:ncol(uncor_data), 1)], select = -loanid), 'default_flag', uncor_data, event.class = 0)
# woe.binning.plot(binning, multiple.plots = FALSE)
uncor_data_binned <- woe.binning.deploy(uncor_data, binning, min.iv.total = 0.1, add.woe.or.dum.var='woe')
uncor_data_binned <- uncor_data_binned %>%
? mutate_at(vars(starts_with("woe")), ~ . / 100)
# Tabulate the binned variables
tabulate_binning <- woe.binning.table(binning)
tabulate_binning

model_data <- subset(uncor_data_binned, select = grepl("binned|default_flag", names(uncor_data_binned)))
colnames(model_data) <- gsub("\\.", "_", colnames(model_data))

You could also create the bins manually and compute the Woe using the script below.

#-----------------------------------------------------------------------------------------------------------------------------------
#--------------------- Function to compute Woe and Iv manually
#------------------------------------------------------------------------------------------------------------------------------------
tbl_fun <- function(x, y) {
? options(scipen = 999)
? y_recode <- ifelse(y == 0, "Good", "Bad")
? Mat <- as.matrix(table(as.factor(x), as.factor(y_recode)))
? Total <- Mat[, 1] + Mat[, 2]?? ?
? Total_Pct <- percent(Total / sum(Mat), accuracy = 0.01)
? Bad_Pct <- percent((Mat[, 1] / sum(Mat[, 1])), accuracy = 0.01)
? Good_Pct <- percent((Mat[, 2] / sum(Mat[, 2])), accuracy = 0.01)
? Bad_Rate <- percent(round((Mat[, 1] / (Mat[, 1] + Mat[, 2])), 2), accuracy = 0.01)
? WOE <- round(log((Mat[, 2] / sum(Mat[, 2])) / (Mat[, 1] / sum(Mat[, 1]))), 4)
? g_b_comp <- ifelse(Mat[, 1] == Mat[, 2], 0, 1)
? IV <- round(ifelse(g_b_comp == 0, 0, ((Mat[, 2] / sum(Mat[, 2])) - (Mat[, 1] / sum(Mat[, 1]))) * (WOE / 10)), 4)
? Total_IV <- sum(IV)
? tbl <- as.data.frame(cbind(Bins = rownames(Mat), Total, Total_Pct, Good = Mat[, 2], Good_Pct, Bad = Mat[, 1], Bad_Pct, Bad_Rate, WOE, IV, Total_IV))
? rownames(tbl) <- NULL
? return(tbl)
}
tbl_fun(select(model_data, starts_with("var"), default_flag)$var8_binned,
??????? select(model_data, starts_with("var"), default_flag)$default_flag)
tbl_fun(select(model_data, starts_with("var"), default_flag)$var43_binned,
??????? select(model_data, starts_with("var"), default_flag)$default_flag)

This is the second batch of the code.

plot1 <- model_data %>
? mutate(default_flag = ifelse(default_flag == 0, "Good", "Bad"),
???????? var19_binned = as.character(var19_binned)) %>%
? na.omit() %>%
? group_by(var19_binned, default_flag) %>%
? summarise(count = n()) %>%
? ggplot(aes(x = var19_binned, y = count, fill = default_flag)) +
? geom_col(position = "stack") +
? labs(x = "Credit History", y = "Count", fill = "Good-Bad") +
? ggtitle("Credit History for var43 After Binning") +
? theme_minimal()

plot2 <- model_data %>%
? mutate(default_flag = ifelse(default_flag == 0, "Good", "Bad"),
???????? Bins = as.character(var19_binned)) %>%
? na.omit() %>%
? group_by(Bins, default_flag, .groups = 'drop') %>%
? summarise(count = n()) %>%
? pivot_wider(names_from = default_flag, values_from = count) %>%
? mutate(Bad_Rate = Bad / (Bad + Good)) %>%
? ggplot(aes(x = Bins, y = Bad_Rate, group = 1)) +
? geom_line(size = 1.5, color = "blue") +
? geom_text(aes(label = scales::percent(Bad_Rate, accuracy = 0.01)),
??????????? vjust = -1.2, size = 3.5, fontface = "bold") +
? labs(x = "Bins", y = "Bad Rate") +
? ggtitle("Bad Rate by Bins for Var43") +
? theme_minimal() +
? scale_y_continuous(labels = scales::percent)

combined_plot <- plot1 + plot2
# Arrange the plots side by side
combined_plot <- combined_plot +
? plot_layout(ncol = 2)
# Display the combined plot
combined_plot%

Below are some of the plots generated by the scripts above.

Count of observations per bin and the corresponding bad rates per bin.

Below is the WoE plot by bins

This is the script for the plot above.

woe_plot %>
? slice(-n()) %>%
? mutate(WOE = as.numeric(WOE)) %>%
? ggplot(aes(x = Bins, y = WOE, fill = factor(sign(WOE)))) +
? geom_col(position = "stack") +
? labs(x = "Bins", y = "WoE", fill = "WoE") +
? ggtitle("WoE plot for the Bins") +
? theme_minimal() +
? scale_fill_manual(values = c("#E7B800", "#00AFBB"),
??????????????????? labels = c("Positive", "Negative"),
??????????????????? guide = guide_legend(reverse = TRUE))

Monotonic binning means the relationship between the variable being binned and the target variable follows a consistent pattern. As the values of independent variable increase or decrease, the values of the target variable also consistently increase or decrease respectively. This is achieved through coarse classing.

wpa package can help achieve this through the script below.

IV_report
? data = uncor_data[, -which(names(uncor_data) == "loanid")],
? predictors = NULL,
? outcome = "default_flag",
? bins = 5,
? max_var = 9,
? path = "IV report",
? timestamp = TRUE
)

The package wpa would give a WoE plot that is monotonic as shown below. Then the final part would be to combine bin 440 through to 525.

Split the data into Training and Test

For modelling purposes the data is split into Train and Test. One set is to be used for training and the other one is to be used for validation. You can also explore K-Fold cross validation technique where you do not need to split the data.

#-------------------------------------------------------------------------------------------------------------------------------------
#------------------ Split the data into Training and Test
# Set the seed for reproducibility
raw_data <- uncor_data_binned[, c("default_flag", sub("_(.*)", "", colnames(select(model_data, starts_with("var")))))]
bins <- woebin(raw_data,
?????????????? y = "default_flag",
?????????????? positive = "bad|1",
?????????????? method = "tree");bins[1]
raw_data <- raw_data[,c("default_flag",names(bins))]
#--------------------------------------------------------------------------------------------------------------------------------------
set.seed(77)
# Perform stratified sampling
train_indices <- createDataPartition(subset(model_data, select = grepl("woe|default_flag", names(model_data)))$default_flag,
???????????????????????????????????? p = 0.8, list = FALSE)
# Split the data into train and test sets based on the indices
train_data <- subset(model_data, select = grepl("woe|default_flag", names(model_data)))[train_indices, ]
test_data <- subset(model_data, select = grepl("woe|default_flag", names(model_data)))[-train_indices, ]

train_raw <- raw_data[train_indices, ]
test_raw <- raw_data[-train_indices, ]
#--------------------------------------------------------------------------------------------------------------------------------------
#-------------------------- Fit a Logistic Regression Model to WOE values
#--------------------------------------------------------------------------------------------------------------------------------------
#--------------- Model 1
colnames(train_data) <- c("default_flag", paste0(sub("^[^_]*_([^_]+)_.*$", "\\1", colnames(train_data))[-1],"_woe"))
formula? <- as.formula(paste("I(default_flag == 1) ~", paste(colnames(select(train_data, ends_with("woe"))), collapse = " + ")))
formula
model_1 <- glm(formula, data = train_data, family = "binomial")
logit_model <- step(model_1, direction = "both", trace = FALSE)
summary(logit_model)

#-------------- Model 2
train_woe <- woebin_ply(train_raw, bins)
formula? <- as.formula(paste("I(default_flag == 1) ~", paste(colnames(select(train_woe, ends_with("woe"))), collapse = " + ")))
formula
null <- glm(default_flag ~ 1, data = train_woe, family = binomial)
full <- glm(formula, data = train_woe, family = binomial)
bicglm <- step(null, scope=formula(full), direction = "both", k = log(nrow(train_woe)))

model <- glm(formula, family = binomial(), data = train_woe)
logistic_step <- step(model, direction = "both", trace = FALSE)
summary(logistic_step)

pred = predict(logistic_step)
y_pred_prob = predict(logistic_step, type = 'response')

After fitting data to the model. It is important to check the following, all p-values are less than 0.05, coefficients are positive and there are no signs of multicolleniarity. If multicolleniarity is detected then you can consider.

Dropping one or more of the highly correlated variables from the model.
Combining or transforming variables to create a single composite variable.
Collecting additional data to reduce the correlation among variables.
Using regularization techniques like ridge regression or lasso regression.

The final results looks like this:

The variables used in the final model are not highly correlated.

Gini and AUC

Interpretation of what AUC and Gini to be considered acceptable performance depends on the specific context, the nature of the problem, and the performance standards set by the domain or industry.

In general, an AUC value of 0.66 is considered reasonable and suggests that the model has some discriminatory power to distinguish between the positive and negative classes. An AUC of 0.5 represents a model with no discrimination (similar to random guessing), while an AUC of 1.0 indicates a perfect classification model. A Gini coefficient of 0.32 suggests a moderate level of discrimination by the model but indicates room for improvement.

Generating a Scorecard

Convert the model coefficients into a simplified scoring system, known as the scorecard. The scorecard assigns points to each predictor variable based on their coefficients, reflecting their relative importance in predicting creditworthiness. The scaling process involves determining the range and granularity of the scorecard, such as assigning points to specific ranges of values or categories.

#-------------------------------------------------------------------------------------------------------------------------------
# Manual scorecard Generation
#--------------------------------------------------------------------------------------------------------------------------------
#=---------------- Get the coefficients from the model
coefficients <- coef(logistic_step);coefficients
#=------------------ Define parameters for the scorecard
points0 <- 600
pdo <- 20
odds0 <- 50
n <- length(model$coefficients[-1]);n
factor <- pdo / log(2)
offset <- points0 - factor * log(odds0)
alpha <- coefficients[1]
BaseScore <- (offset - (factor * coefficients[1]));round(BaseScore,0)

The score card is generated using the script below.

# Create an empty data frame for the scorecar
card <- data.frame(Variable = character(),
?????????????????? Bin = character(),
?????????????????? Woe = numeric(),
?????????????????? Iv = numeric(),
?????????????????? Coef = numeric(),
?????????????????? stringsAsFactors = FALSE)

#------------- Calculate the score for each bin of each variable
suppressWarnings({
for (var in sub("^(.*?)_.*$", "\\1", colnames(train_data)[-1])) {
? var_tbl <- bins[var]
? variable <- var_tbl$var$variable
? bin <- var_tbl$var$bin
? woe <- var_tbl$var$woe
? iv <- var_tbl$var$bin_iv
? coefficient <- coefficients[paste0(var, "_woe")]
? card <- rbind(card, data.frame(Variable = variable, Bin = bin, Woe = woe, Iv = iv, Coef = coefficient))
? print(var)
}
})
card$Neutral <- round(-(alpha/n)*factor+offset/n,0)
card$ScoreCntr <- round(card$Coef * card$Woe,4)
card$Score <-? round((card$ScoreCntr + (alpha/n))*factor + offset/n,0)
head(card,4)

From the scorecard above total score per customer decreases with probability or likelihood of default.

When scaling the model into a scorecard, logistic regression coefficients from model fitting as well as the transformed WoE values are required. Subsequently you need to convert the score from the log-odds unit to a points system.

βi :- logistic regression coefficient for the variable Xi
α :- logistic regression intercept
WoE :- Weight of Evidence value for variable Xi
n :- number of independent variable Xi in the model
Factor, Offset :- known as scaling parameter

From the script above base score was set at 529

BaseScore <- (offset - (factor * coefficients[1]))

This is the resulting score card.

Determination of Cutoff Score

The choice of cut-off score depends on the specific requirements, risk appetite, and business objectives of the organization implementing the credit scoring model. Different methods may be used in combination, and adjustments to the cut-off score may be made over time based on performance monitoring and feedback.

Ultimately, the determination of the cut-off score involves finding a balance between minimizing credit risk and maximizing business opportunities while considering various factors, including model performance, business objectives, and regulatory considerations.

Receiver Operating Characteristic (ROC) Curve: The ROC curve is a graphical representation of the performance of a credit scoring model at various threshold levels. The cut-off score can be determined by selecting a point on the curve that optimizes the balance between sensitivity (the ability to correctly identify positive cases) and specificity (the ability to correctly identify negative cases). This approach allows for a trade-off between false positives and false negatives.
Cost-Benefit Analysis: This method involves evaluating the costs and benefits associated with different cut-off scores. The costs may include defaults, collection efforts, or potential losses, while the benefits may include interest income, loan volumes, or customer acquisition. The optimal cut-off score is the one that maximizes the net benefit or minimizes the overall cost.
Gini Coefficient or Kolmogorov-Smirnov (KS) Statistic: The Gini coefficient and KS statistic are measures of the predictive power of a credit scoring model. These metrics indicate the separation between good and bad credit risks. The cut-off score can be determined by selecting the point on the score distribution where the Gini coefficient or KS statistic is maximized.

You can use the gain lift table to set the cutoff score. The gain lift table is constructed by dividing the population into deciles or groups based on the predicted probabilities or scores from the credit scoring model.

The optimal cut-off score on the score scale can be obtained by examining the KS curve maximum vertical distance (i.e., the largest gap) between the cumulative good percentage and the cumulative bad rate. This point corresponds to the optimal cut-off score. Observations with scores above the cut-off score will be classified as "good," while those below the cut-off score will be classified as "bad."

The cut-off score involves a trade-off between the false positive rate (classifying a "good" observation as "bad") and the false negative rate (classifying a "bad" observation as "good"). The specific choice of the cut-off score will depend on the desired balance between these two types of errors, which may vary depending on the application and the costs associated with misclassifications.

y = train_raw$default_fla
pred = predict(logistic_step)
y_pred_prob = predict(logistic_step, type = 'response')
resp = predict(logistic_step, type = 'response')
res = tibble(logit = pred
???????????? , y_pred_prob = y_pred_prob
???????????? , y_actual = y
???????????? , odds = exp(pred)
???????????? , prob = odds / (odds + 1)
???????????? , prob_ctrl = resp )
sc = scorecard_ply(train_raw, score_card)
res$score = sc[[1]]
res$score_ctrl = offset - factor * res$logit
res

ks.test(res$y_pred_prob, res$y_actual, alternative='greater')
df_summary <- res %>%
? arrange(score) %>%
? mutate(cum_bad = cumsum(y_actual) / sum(y_actual),
???????? cum_good = cumsum(1 - y_actual) / sum(1 - y_actual))


max_ks_score <- df_summary$score[which.max(abs(df_summary$cum_bad - df_summary$cum_good))]
ggplot(df_summary, aes(x = score)) +
? geom_line(aes(y = cum_bad), color = "red", size = 1, linetype = "solid") +
? geom_line(aes(y = cum_good), color = "blue", size = 1, linetype = "solid") +
? geom_vline(aes(xintercept = max_ks_score), color = "orange",
???????????? linetype = "dashed", size = 1.5) +
? annotate("text", x = max_ks_score,
?????????? y = max(df_summary$cum_bad, df_summary$cum_good) + 0.05,
?????????? label = paste0("KS =", round(max(abs(df_summary$cum_bad - df_summary$cum_good)), 3)), color = "black") +
? annotate("text", x = max_ks_score,
?????????? y = -Inf,
?????????? label = paste0("x =", round(max_ks_score, 3)),
?????????? vjust = -1, color = "black") +
? labs(x = "Score Cutoff", y = "Cumulative Rate", title = "KS Plot") +
? theme_minimal()

Additionally, it's important to validate the chosen cut-off score on an independent dataset or through cross-validation to ensure its generalization and performance.

Development of credit scoring models and Cut-off score setting in R.

Tony Gitonga

Data Strategy | Finance Analytics | Digital Banking | Corporate Strategy | Change Management | Data Science | Machine Learning | Deep Learning | AI

EDA - Exploratory Data Analysis

Step 1: Prepare the environment.

Vintage Analysis

Roll Rate Analysis

Analysis by Origination Month

Definition of Good Bad Label

Relationship between Regressor and Response variable

Feature Engineering

Correlation

领英推荐

Other methods of Feature Selection

Fine and Coarse classing

Split the data into Training and Test

Gini and AUC

Generating a Scorecard

Determination of Cutoff Score

更多精彩文章

社区洞察

其他会员也浏览了

Understanding Credit Modeling and Willingness to Pay: A CedisPay Perspective

How to Use a Credit Decision Engine Effectively

Trends in credit data: What to watch in 2024

Mastering Business Analysis in the Banking Domain: A Deep Dive

Validating New Generation Credit Risk Models

Overcoming bureau lock-in and leveraging multiple data feeds

Why lenders need more than rate negotiations for credit data contracts

Credit Scoring (I)

Automation of Branch Review Process - Case Study

Exploring Interpretable Scorecard Boosting

EDA - Exploratory Data Analysis

Step 1: Prepare the environment.

Vintage Analysis

Roll Rate Analysis

Analysis by Origination Month

Definition of Good Bad Label

Relationship between Regressor and Response variable

Feature Engineering

Correlation

领英推荐

Other methods of Feature Selection

Fine and Coarse classing

Split the data into Training and Test

Gini and AUC

Generating a Scorecard

Determination of Cutoff Score

Guardians of Trust

2024年2月14日

Lottery Statistics A-Z

2023年1月3日

Credit Risk - Vintage Analysis Implementation in Python

2022年11月16日

Index Based Insurance

2022年9月30日

How to build & analyze a Lottery

2021年8月17日

Kenya Insurance Outlook

2019年4月9日

Its Good to do It!!!

2018年12月19日

社区洞察

其他会员也浏览了

Understanding Credit Modeling and Willingness to Pay: A CedisPay Perspective

How to Use a Credit Decision Engine Effectively

Trends in credit data: What to watch in 2024

Mastering Business Analysis in the Banking Domain: A Deep Dive

Validating New Generation Credit Risk Models

Overcoming bureau lock-in and leveraging multiple data feeds

Why lenders need more than rate negotiations for credit data contracts

Credit Scoring (I)

Automation of Branch Review Process - Case Study

Exploring Interpretable Scorecard Boosting