Credit Risk Modeling
Hope you are all doing well and staying safe. There are various ways to go about building a model including, but not limited to the use of any specific approach or technique. Following is only a step-wise summary to build an overview for the beginners. I have tried to focus on the broad steps to build a model and keep the points very brief to capture the overview of the task. There are no examples of specific codes given as it would depend of choice of programming language.
I have gathered about the topic through my academic endeavors. I am not an expert modeler albeit an evolving one. So?any suggestion, criticism or guidance on the topic is appreciated and encouraged.
Following is a generic process to go about the task (assuming a basic understanding of ML process):
PD Model
As mentioned earlier, there are many modeling methods but BASEL (and other national regulators) prescribe logistic model for estimating PD. This is because of simplicity of the model, the control and the visibility it offers in the process as compared to ensemble methods. Steps:
2. Check the default % in the portfolio i.e. Defaulters to Total observations (calculated in the image above) . Now, balance the dataset (to have equal defaulters and non-defaulters - for model efficiency), we would have to drop considerable number of records, so that the proportion changes from 11% to about 40 - 50%. For eg. records having missing entries for key variables like income, credit history etc can be considered for dropping.
Dropped records can be combined later with defaulter dataset again to create a separate model. These different models can later be combined (topic for another article). Another way is to create enough dummy records closely resembling current defaulter dataset - to balance the number of defaulters vs non-defaulters in the dataset.
3. Pre-processing the data for modeling and creation of training and test data. This involves checking out missing values, conversion of dates to 'Days since origination' variable, creating dummy variables for categorical variables, converting string variables to numerical etc. During this process, we can identify additional records to be dropped to balance the dataset (from point 2). Target?variable should be created as a separate dataframe.
4. Creating the dummy variables for categorical as well as continuous variables. Continuous variables would have to divided in 5-10 intervals.
5. Exploring the significance of variables by checking the Weight of Evidence (WoE) and Information Value (IV). This, in itself is a long process, which includes building and automating the function for visualizing whether the classes (categorical) or intervals (continuous variables) are able to distinguish between the defaulters and non-defaulters.
WoE?= Log (proportion of good customers in a category, out of total good customers / proportion of bad customers in that category, out of total defaulters or bad customers)
IV = Overall sum of?{ (Propn_good in each category – Propn_bad in that category) * WOE of that category }
Sample WoE table for a feature - Grades
Based on the WoE and IV, select the important variables which have clear deterministic effect on target variable. Based on initial run of WoE, adjustment of initial intervals (continuous variables) and combining of classes (categorical variables) might be required enhance prediction power of similar classes and at the same time, to reduce the number of categories and intervals to a manageable number. For e.g. in the following images 2 to 4, progressively tail and front end of the graphs is removed to enhance the differentiating power of similar (in terms of WoE) address zones.
Excluding outliers for analysis - Only after excluding (temporarily - for analysis purpose) NE, IA and ID zones (3rd image in this step) and then WY and DC (4th image in this step), could we actually visualize the underlying differences among the rest of the regions in terms of WoE.
Now, we can combine NM through MO, LA through CA etc, based on where we can see the inflections in WoE curve - and reduce the number of categories and improve prediction power. Outliers can be one category if they have only miniscule % of population as compared to the rest (Ref: WoE table above for % population).
Similarly, intervals for continuous variables have to be readjusted from the initial ranges, to the ones with similar characteristics. Only difference here is that there is a particular order (lower to higher) in numbers, we can't club distinct intervals. For e.g. Defaults in lower income category (say 20,000-40,000) can't be combined with higher category (>150,000), as the other respective variables/identifiers might be different. Clearly, these are two different groups. In the following image, 3rd interval has similar WoE as 15th interval, but there is a significant gap in their income groups. They would not be clubbed together. Here, clubbing would have to follow order - we might want to keep first 4 categories together or may be extend upto 6th category, if the % population is not these 4 / 6 categories is not significant individually (say 5%). Then club 7th to 12th, after which we see an inflection point, and so on.
6. Train the Logistic model based on training and test set created --> create the summary table for coefficients of the variables with the intercept. (After some refining through next steps, remaining variables in this table would serve as the basis for the scorecard)
7. Scrutinizing the list of variables further if required: This can be done by many methods incl,
8. Train the model again using reduced set of variables
9. Predict target for the test set using the trained model
10. Decide on the cut-off point probability which would differentiate among defaulter vs?non-defaulter classification. This would depend on the risk tolerance guidelines of the business and has a direct bearing on the provision numbers / riskiness of the portfolio.
11. Based on the cut-off classify the predictions into 0 and 1 for Target (defaulters vs?non-defaulters). Observe that some of the previous predictions have been reclassified in accordance with the threshold (cut-off point probability mentioned in the above point).
12. Validation / testing of the model: how good is model in predicting the classification correctly
13. Update the summary table with the coefficients of the new model (Ref: point 6 and 8)
Snapshot of the resultant scorecard (truncated)
14. Calculation of approval / rejection rate by creating a function which runs through FPR, TPR, Thresholds created while checking ROC curve (Validation, point 12).
领英推荐
For each threshold probability, we progressively see how many of our predictions PD will be approved for loan. For e.g. if we keep threshold probability at only 1% chance of default, approximately only 37 loans would be approved from our dataset. This is similar to how we reclassified predictions in point 9 vs 11. Number of 37 loans can be seen in the 4th entry of the image above.
We already have considered maintaining some PD for classifying default vs non-default in point 11. We would / could want to see the approval numbers wrt PD before finalizing the basis. There would also be a consideration on how this numbers affects profitability.
Note: Model performance can be improved by considering:
LGD Model
1.????It uses combination of logistic and linear models to predict LGD / Recovery rate
2. Target variable Recovery rate can be calculated as
3.????Dataset is the same, except?only the defaulted/delinquent status loans should be considered for modelling LGD.?Consider only the variables, which were considered in building the PD model
4.????Validate the LGD model by:
We have considered only defaulters in the data, thus RR=0 has higher concentration indicating most defaulters have RR = 0, while there are progressively lower number of defaulters which have higher RR. Negative values can occur because of time difference in recording various parameters like utilization vs limit, or outstanding vs repayment etc. Limit negative values to zero and values > 1, to 1.
Snippet of predictions by first LGD model
Snippet of predictions by second LGD model
Note, that RR =?0 where LGD 1 model predicted no recovery. We have forced the LGD 2 RR predictions for these observations to be zero by multiplying predictions of 1st model by 2nd model. This is because Linear model would predict some value based on the inputs, but these records have higher chance of zero recovery as differentiated in the first model. We also arrive at LGD - our target parameter.
LGD modelling follows the same steps as in PD, except for the choice of models – Logit / Linear consecutively. To increase the accuracy of the model, we may want to consider including more variables from original dataset (and not only the ones used to model PD). This can be based on the p-values generated again for the LGD model.
EAD model
1.????EAD can be modelled using Linear regression as it is direct estimation of exposures
2. Target variable should CCF (Credit Conversion Factor) which is basically expected amount to be utilized in future by a particular customer (Unutilized amount as on date * Probability of utilization)
3. Train the model and predict values
4.?Validate the model by:
5.?Check for anomalies in the descriptive stats for the prediction like negative values.
?Tying it all together – Estimating Expected Loss
1.????Use the PD, LGD, EAD models on the whole original dataset - Estimated columns for each parameter
2.????Create EL column (Expected Loss) = PD column * LGD column * EAD column
3.????Calculate the EL% = total EL amount divided by total funded amount from the dataset (portfolio)
EL is used as provision against the loan assets and represents the amount of portfolio on Balance Sheet, which may become delinquent. As per my understanding observed ratio of expected loss (EL/Funded amount or portfolio) is 2% to 7%. This number also impacts the profitability directly through P&L. Thus, this becomes a key number to consider while forming risk policies. EL% is used in addition to other factors to adopt a stance on credit policy – aggressive (riskier portfolio and better margins) or risk averse (safer loans and lesser margins).
Hope the above generic summary /guide /steps helped you to understand the process. Please let me know if any point is unclear / incorrect or drop a comment on how do you like the post, any suggestions for improvement.
Check out my next article on?Model Monitoring for Credit Card / Retail Exposures (using PSI).
More on other stages, methods in credit risk management, model monitoring etc in later posts. Ciao.
PS: Be safe and practice precautions. COVID is not over yet.
Risk / Change Management | Treasury | Research
7 个月Lower p-value : stat significant High p-value: stat insignificant (that is what i meant by lower p-value significance). Perhaps i should rewrite it to make it more apparent. Thanks Jinwin Josey
Data and Analytics Professional
7 个月Hi Abhimanyu, Under step 7 - 7. Scrutinizing the list of variables further if required: This can be done by many methods incl, ? p-value significance: Variables with lower p-value significance, can be dropped. ? Drop the variables whose significance is inexplicable / not relevant. ? Step-wise regression etc Is this a typo? Wouldn't we keep the variables with low p-values? Or am I missing something?