Credit Risk Modeling

Hope you are all doing well and staying safe. There are various ways to go about building a model including, but not limited to the use of any specific approach or technique. Following is only a step-wise summary to build an overview for the beginners. I have tried to focus on the broad steps to build a model and keep the points very brief to capture the overview of the task. There are no examples of specific codes given as it would depend of choice of programming language.

I have gathered about the topic through my academic endeavors. I am not an expert modeler albeit an evolving one. So?any suggestion, criticism or guidance on the topic is appreciated and encouraged.

Following is a generic process to go about the task (assuming a basic understanding of ML process):

PD Model

As mentioned earlier, there are many modeling methods but BASEL (and other national regulators) prescribe logistic model for estimating PD. This is because of simplicity of the model, the control and the visibility it offers in the process as compared to ensemble methods. Steps:

  1. Understanding the raw data/variables and identifying the target variable states. Different loans in the portfolio can be in various states from 'Performing / Current', '30 DPD', '60 DPD' and so on. It should be clearly established that which states of loans are considered as 'Default'. Generally, loans having any DPD (days past due) state is considered as 'Default', but it depends on the institutions' preferences in modeling assumptions.

No alt text provided for this image

2. Check the default % in the portfolio i.e. Defaulters to Total observations (calculated in the image above) . Now, balance the dataset (to have equal defaulters and non-defaulters - for model efficiency), we would have to drop considerable number of records, so that the proportion changes from 11% to about 40 - 50%. For eg. records having missing entries for key variables like income, credit history etc can be considered for dropping.

Dropped records can be combined later with defaulter dataset again to create a separate model. These different models can later be combined (topic for another article). Another way is to create enough dummy records closely resembling current defaulter dataset - to balance the number of defaulters vs non-defaulters in the dataset.

No alt text provided for this image

3. Pre-processing the data for modeling and creation of training and test data. This involves checking out missing values, conversion of dates to 'Days since origination' variable, creating dummy variables for categorical variables, converting string variables to numerical etc. During this process, we can identify additional records to be dropped to balance the dataset (from point 2). Target?variable should be created as a separate dataframe.

4. Creating the dummy variables for categorical as well as continuous variables. Continuous variables would have to divided in 5-10 intervals.

5. Exploring the significance of variables by checking the Weight of Evidence (WoE) and Information Value (IV). This, in itself is a long process, which includes building and automating the function for visualizing whether the classes (categorical) or intervals (continuous variables) are able to distinguish between the defaulters and non-defaulters.

WoE?= Log (proportion of good customers in a category, out of total good customers / proportion of bad customers in that category, out of total defaulters or bad customers)

IV = Overall sum of?{ (Propn_good in each category – Propn_bad in that category) * WOE of that category }

Sample WoE table for a feature - Grades

No alt text provided for this image

Based on the WoE and IV, select the important variables which have clear deterministic effect on target variable. Based on initial run of WoE, adjustment of initial intervals (continuous variables) and combining of classes (categorical variables) might be required enhance prediction power of similar classes and at the same time, to reduce the number of categories and intervals to a manageable number. For e.g. in the following images 2 to 4, progressively tail and front end of the graphs is removed to enhance the differentiating power of similar (in terms of WoE) address zones.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Excluding outliers for analysis - Only after excluding (temporarily - for analysis purpose) NE, IA and ID zones (3rd image in this step) and then WY and DC (4th image in this step), could we actually visualize the underlying differences among the rest of the regions in terms of WoE.

Now, we can combine NM through MO, LA through CA etc, based on where we can see the inflections in WoE curve - and reduce the number of categories and improve prediction power. Outliers can be one category if they have only miniscule % of population as compared to the rest (Ref: WoE table above for % population).

Similarly, intervals for continuous variables have to be readjusted from the initial ranges, to the ones with similar characteristics. Only difference here is that there is a particular order (lower to higher) in numbers, we can't club distinct intervals. For e.g. Defaults in lower income category (say 20,000-40,000) can't be combined with higher category (>150,000), as the other respective variables/identifiers might be different. Clearly, these are two different groups. In the following image, 3rd interval has similar WoE as 15th interval, but there is a significant gap in their income groups. They would not be clubbed together. Here, clubbing would have to follow order - we might want to keep first 4 categories together or may be extend upto 6th category, if the % population is not these 4 / 6 categories is not significant individually (say 5%). Then club 7th to 12th, after which we see an inflection point, and so on.

No alt text provided for this image

6. Train the Logistic model based on training and test set created --> create the summary table for coefficients of the variables with the intercept. (After some refining through next steps, remaining variables in this table would serve as the basis for the scorecard)

No alt text provided for this image

7. Scrutinizing the list of variables further if required: This can be done by many methods incl,

  • ?p-value significance: Variables with lower p-value significance, can be dropped
  • ?Drop the variables whose significance is inexplicable / not relevant
  • ?Step-wise regression etc

8. Train the model again using reduced set of variables

9. Predict target for the test set using the trained model

No alt text provided for this image

10. Decide on the cut-off point probability which would differentiate among defaulter vs?non-defaulter classification. This would depend on the risk tolerance guidelines of the business and has a direct bearing on the provision numbers / riskiness of the portfolio.

11. Based on the cut-off classify the predictions into 0 and 1 for Target (defaulters vs?non-defaulters). Observe that some of the previous predictions have been reclassified in accordance with the threshold (cut-off point probability mentioned in the above point).

No alt text provided for this image

12. Validation / testing of the model: how good is model in predicting the classification correctly

  • Confusion Matrix, Accuracy, KS, Gini score, AUC, AUROC etc
  • Gini curve – The goal is to capture higher % of defaulters early on (within first few deciles of cum. population)
  • KS curve – The model estimations should be able to differentiate between the good and the bad customer, thus, the far apart the two curves on KS – the better predictability.?

No alt text provided for this image
No alt text provided for this image

13. Update the summary table with the coefficients of the new model (Ref: point 6 and 8)

  • Include the dropped categories while creating dummy variables with p-values as 'N/a'
  • Map the coefficients in the summary table to the range of scores say 300 - 850 by interpolation -->?Scorecard
  • Max score, min score = 850, 300
  • Score = Coefficients * (max_score - min_score) / (max_sum_coef?-?min_sum_coef)
  • Intercept in the scorecard = min score + (intercept_coef – min_sum_coef) * (max_score - min_score) / (max_sum_coef?-?min_sum_coef)
  • Intercept in the above formula has been treated in the way to give a base minimum score to worst applicant – on the top of which, scores for progressively better candidates would be built, by adding their better scores for each feature

No alt text provided for this image
No alt text provided for this image

Snapshot of the resultant scorecard (truncated)

No alt text provided for this image

14. Calculation of approval / rejection rate by creating a function which runs through FPR, TPR, Thresholds created while checking ROC curve (Validation, point 12).

  • Combine the above three parameter columns in a Dataframe
  • Map the Threshold to scores (through the process of interpolation) where the maximum threshold = max score; and other thresholds follow the spread between (max -?min) score and (max-min) coefficients for each distinct feature (best candidate having the best coefficients and vice versa). Following is the calculation:
  • Threshold Score = { log [threshold (x) / (1 – threshold (x)) ] - min_sum_coef } * [ (max_score - min_score) / (sum of highest coef in each category - sum of lowest coef) ] + min_score

No alt text provided for this image

For each threshold probability, we progressively see how many of our predictions PD will be approved for loan. For e.g. if we keep threshold probability at only 1% chance of default, approximately only 37 loans would be approved from our dataset. This is similar to how we reclassified predictions in point 9 vs 11. Number of 37 loans can be seen in the 4th entry of the image above.

We already have considered maintaining some PD for classifying default vs non-default in point 11. We would / could want to see the approval numbers wrt PD before finalizing the basis. There would also be a consideration on how this numbers affects profitability.

Note: Model performance can be improved by considering:

  • if any change in threshold probability can be considered
  • including more data for modeling
  • including more relevant variables or excluding some more which do not seem to have direct impact on target variables (false significance)

LGD Model

1.????It uses combination of logistic and linear models to predict LGD / Recovery rate

  • First model, Logistic model is used to predict/separate loans with 0 recovery rate from those with non-zero recovery rate.
  • Second model, for the predicted loans with non-zero recovery rate, use the linear model to predict Recovery Rate on the default prediction.
  • Combine the results from Logistic and Linear model to have predictions of recovery rate in one column, with floor at 0 and ceiling at 1

2. Target variable Recovery rate can be calculated as

  • 1 - LGD; or
  • Recoveries after the default / Funded amount

3.????Dataset is the same, except?only the defaulted/delinquent status loans should be considered for modelling LGD.?Consider only the variables, which were considered in building the PD model

  • All the related variables like CCF, Recovery rate, Default status should be dropped from training input set

4.????Validate the LGD model by:

  • 1st model (Classification model): AUC, ROC, Confusion matrix
  • 2nd model (Linear regression): Correlation, MSE, Distplot of residuals (Actuals vs predictions)

No alt text provided for this image
No alt text provided for this image

We have considered only defaulters in the data, thus RR=0 has higher concentration indicating most defaulters have RR = 0, while there are progressively lower number of defaulters which have higher RR. Negative values can occur because of time difference in recording various parameters like utilization vs limit, or outstanding vs repayment etc. Limit negative values to zero and values > 1, to 1.

Snippet of predictions by first LGD model

No alt text provided for this image

Snippet of predictions by second LGD model

No alt text provided for this image

Note, that RR =?0 where LGD 1 model predicted no recovery. We have forced the LGD 2 RR predictions for these observations to be zero by multiplying predictions of 1st model by 2nd model. This is because Linear model would predict some value based on the inputs, but these records have higher chance of zero recovery as differentiated in the first model. We also arrive at LGD - our target parameter.

LGD modelling follows the same steps as in PD, except for the choice of models – Logit / Linear consecutively. To increase the accuracy of the model, we may want to consider including more variables from original dataset (and not only the ones used to model PD). This can be based on the p-values generated again for the LGD model.

EAD model

1.????EAD can be modelled using Linear regression as it is direct estimation of exposures

  • ?All the records (not only defaults/ delinquents) shall be used in the modelling
  • All the related variables like CCF, Recovery rate, Default status should be dropped from training set
  • Start with considering only the variables, which were considered in building the PD model

2. Target variable should CCF (Credit Conversion Factor) which is basically expected amount to be utilized in future by a particular customer (Unutilized amount as on date * Probability of utilization)

  • EAD = Funded amount * CCF

3. Train the model and predict values

4.?Validate the model by:

  • Checking the correlation between predicted and actual test set
  • Visualizing and observing the distribution of residuals (predictions – test set) à This should be closer to zero

5.?Check for anomalies in the descriptive stats for the prediction like negative values.

  • Limit the negative values to Zero

No alt text provided for this image

?Tying it all together – Estimating Expected Loss

1.????Use the PD, LGD, EAD models on the whole original dataset - Estimated columns for each parameter

2.????Create EL column (Expected Loss) = PD column * LGD column * EAD column

3.????Calculate the EL% = total EL amount divided by total funded amount from the dataset (portfolio)

No alt text provided for this image


EL is used as provision against the loan assets and represents the amount of portfolio on Balance Sheet, which may become delinquent. As per my understanding observed ratio of expected loss (EL/Funded amount or portfolio) is 2% to 7%. This number also impacts the profitability directly through P&L. Thus, this becomes a key number to consider while forming risk policies. EL% is used in addition to other factors to adopt a stance on credit policy – aggressive (riskier portfolio and better margins) or risk averse (safer loans and lesser margins).

Hope the above generic summary /guide /steps helped you to understand the process. Please let me know if any point is unclear / incorrect or drop a comment on how do you like the post, any suggestions for improvement.

Check out my next article on?Model Monitoring for Credit Card / Retail Exposures (using PSI).

More on other stages, methods in credit risk management, model monitoring etc in later posts. Ciao.

PS: Be safe and practice precautions. COVID is not over yet.

Abhimanyu Trakroo, FRM

Risk / Change Management | Treasury | Research

7 个月

Lower p-value : stat significant High p-value: stat insignificant (that is what i meant by lower p-value significance). Perhaps i should rewrite it to make it more apparent. Thanks Jinwin Josey

回复
Jinwin Josey

Data and Analytics Professional

7 个月

Hi Abhimanyu, Under step 7 - 7. Scrutinizing the list of variables further if required: This can be done by many methods incl, ? p-value significance: Variables with lower p-value significance, can be dropped. ? Drop the variables whose significance is inexplicable / not relevant. ? Step-wise regression etc Is this a typo? Wouldn't we keep the variables with low p-values? Or am I missing something?

回复

要查看或添加评论,请登录

Abhimanyu Trakroo, FRM的更多文章

社区洞察

其他会员也浏览了