6. Credit Risk Modelling for MSMEs - feature engineering & model development
Vivek Chaturvedi
Leader in Advanced Analytics and Data Science | Retail, SME, Corporate, Treasury, Transaction Banking & Wealth Management | India & MENA | Thought leader | Public speaker | IIM Bangalore
In the last article we saw the various types of financial variables which are used to assess credit risk. In the past few years, technology has evolved and better computing power allows us to try out more combinations of variables before we finalize the ones which we would want to use in a probability of default model. The process of creating variables and shortlisting the useful ones is called feature engineering (the independent variables being referred to as the features). In this article we will use the word feature and variable interchangeably.
Feature creation
As we have seen various variables like Gross profit margin, Fixed asset turnover ratio, Inventory turnover ratio etc. can be used to analyze credit worthiness of a borrower. We go one step ahead and create more features (variables) from these. This step requires imagination on the part of model developer. He/she has to imagine the features which should have an effect on the outcome. Like calculating the current ratio this year by the current ratio of last year thereby checking if the liquidity profile of the borrower has improved or deteriorated. I have listed a few features below. This list is not exhaustive as the purpose is to illustrate the process.
a.??????Cash and bank balance/ Adjusted tangible net worth
b.??????Cash and bank balance/ Net sales
c.??????Total outside liabilities / Total assets
Going like this we can create 300 – 500 features from financial statements.
Feature selection
The next step is to reduce the number of features by selecting a few and rejecting others. Feature selection should be stepwise process allowing the model developer to analyze and evaluate features gradually.
Step 1 – Fill Rate & Univariate Gini
In this step we use two criteria to short list features – fill rate and univariate Gini. These are the simplest ways to identify features.
Fill rate
The first filter is fill- rate. We want to keep only those features which are available for a sufficiently large proportion of the population. If a variable is a great predictor of default, however, it is not available for many of the candidates then it is of little use. We measure fill rate as a percentage of records for which the variable is available out of total records. We reject all such variables which have a fill rate below a threshold. Selecting a threshold is call of the model developer. We have taken a fill rate of 90% for the feature to be retained.
Univariate Gini
Gini is a measure of predictive power of a model[1]. Univariate Gini tells us how good a variable is in predicting the default all by itself. A good cut off can be 10% or 20% depending on the number of features created. One may use packages or libraries in R/ Python to calculate Gini.
Step 2 – Principal Component Analysis
In the next step we conduct Principal Component Analysis (PCA) to identify features which might have a bearing on the dependent variable.
Principal Component Analysis
Principal component analysis, or PCA, is a statistical procedure that allows you to summarize the information content in large data tables by means of a smaller set of “summary indices” that can be more easily visualized and analyzed[2]. PCA arranges the existing features in groups. Group 1 should have highest number of features and it indicates that those features are better predictors of default. Group 2 has fewer features and so on. We should select features in a way that we get maximum from top numbered groupings and also get features from various categories (turnover, profitability etc. – please refer to the previous articles).
领英推荐
Step 3 – Binning and Variable Transformation
In this step we perform two exercises. We bin the variables thus creating categorical variables from numerical and we transform the variables to be used for modelling.
Binning
We put the variables in bins or class intervals and observe the “bad – rate” in those bins. Now for a moment let us take a diversion and use a different analogy to understand the importance of binning. Imagine that there are two kids of age five years and four years and you are asked if you think they are significantly different in their ability to talk and communicate. Your answer will most likely be no. But, if there are two kids with ages three and two and you are asked the same question, your answer might change.
When it comes to behavior the absolute difference in value might not matter as much as the threshold. And that is what binning does. Whether the interest coverage ratio of a company is 1.2 or 1.3 might not be very significant information, but whether it is 1 or 0.9 might create a lot of difference in its credit worthiness.
We bin the variables based on the bad rates in those bins. We start with 20 bins and gradually merge those till we get monotonicity[3] in the bad rate. Please see the images below.
Figure 1 - Original bins
Figure 2 - modified bins
Variables Transformation
In my previous articles I have argued that we should use derived variables instead of raw variables for model development. The simple logic being that it does not matter how much current assets you have, what matter is what is the proportion of current assets to current liabilities. And then I extended that argument for creation of chaid variables. I take that argument one step ahead and say that what matter for a feature to be useful is the sample of “bads” contained in the bin divided by the sample of “goods”. We refer to this variable as Weight of Evidence (WOE) and use it for the model development.
Model development
The new variables thus developed can be used as independent variables in a logistic regression model. It is a very standard process and therefore I will not describe that in detail here.
[1] https://www.crisil.com/content/dam/crisil/our-analysis/publications/default-study/crisil-ratings-annual-default-and-ratings-transition-study-fy-2022.pdf
[2] https://www.sartorius.com/en/knowledge/science-snippets/what-is-principal-component-analysis-pca-and-how-it-is-used-507186
[3] https://www.dhirubhai.net/pulse/understanding-financial-statements-2-vivek-chaturvedi/