Benchmarking PD Models
Credit: https://unsplash.com/@googledeepmind

Benchmarking PD Models

When evaluating various scoring functions for the Probability of Default (PD) modeling, the most commonly assessed performance metric is the Gini score. The Gini score measures the rank-ordering power of the model, which helps determine how effectively the model can identify default observations based on the order of its predictions. A Gini score of 1 indicates perfect discrimination ("crystal ball" performance), while a score of 0 implies no discriminatory power. However, in certain scenarios, the Gini coefficient can be a misleading metric and could lead to a selection of a less optimal model.

Imagine a situation where two scoring functions, WOE Logistic Regression (WOE LR) and XGBoost, perform similarly in terms of the Gini score (Gini = 2 * AUC - 1):

ROC AUC curve for the two scoring functions

In this scenario, a common method of model evaluation beyond the Gini coefficient is to establish a classification rule based on a probability model and assess the outcomes using a cost-benefit analysis. This analysis involves assigning fixed costs and benefits based on the costs of misclassification at a predetermined cut-off.

Cut-off setting: not recommended for Credit Risk problems

Given that this approach requires setting a specific cut-off point for classifying a prediction as being correct (e.g., 4% or 50%), it's not guaranteed that a model selected using this method would perform similarly on the new data when the underlying distribution of the target variable is different from the training dataset.

Log Loss

One metric for assessing PD model performance that is not so often discussed is a logarithmic scoring rule, more known as Log Loss:

Log Loss

Log Loss measures the proximity of predictions to the true labels and is a common loss function in model estimation methods such as Logistic Regression and Gradient Boosted Decision Trees. Additionally, this metric is employed to assess the calibration of forecasts.

Log Loss assigns more weight to predictions that are "correct". When predictions deviate significantly, for example, when the predicted probability is 0.99 for a "good" customer (default flag is 0) or 0.01 for a "bad" customer (default flag is 1), we observe large loss values, indicating lower accuracy of our predictions:

High Log Loss example

Alternatively, if the predicted probabilities are close to the true labels of default event, the loss will be close to 0 indicating high accuracy of our predictions:

Low Log Loss example

Below Loss curves for the two scoring functions with similar Gini scores are shown:

Loss curves for two models

We can observe that Loss curves for our models are quite similar, but there are differences in the lower probability range for the defaulted cases (left chart) and in the upper probability range of the non-defaulted cases (right chart) where loss values are high indicating inaccuracies. Naive loss corresponds to a forecast of a sample average default rate (50%) for each observation.

Looking at these results it is difficult to draw a conclusion whether one model is preferable over the other.

Loss Uplift

There is a way how we can compare two Loss metrics to benchmark the two models using a Loss uplift metric:

Loss Uplift

The Loss uplift metric is conceptually similar to Normalized Cross Entropy from He X. et al. 2014 paper, but here we benchmark our challenger model against a baseline model and not a "random guess" or naive prediction model (e.g., assigning 50% probability for every observation).

Essentially, Loss uplift an arithmetic average of the absolute differences between Log Loss values of the two models for each prediction. Negative values indicate better performance compared to the baseline and vice versa.

In our example, we will benchmark WOE LR (challenger) against XGBoost (baseline).

Loss uplift can be calculated for each class separately and on the overall level, the results are reported below:

  • Uplift goods (class = 0): -0.20%
  • Uplift bads (class = 1): -0.02%
  • Uplift: -0.10%

The metrics above show that WOE LR has a lower error and has an improved performance for both classes.

Business metric

To visualize the Loss uplift, we will create a business metric based on a cost-benefit analysis with the following logic in mind:

  • We assign +€100 when our challenger model predicted a higher probability for Class 1 vs the baseline and -€100 otherwise
  • We assign +€50 when our challenger model predicted lower probability for Class 0 vs the baseline and -€50 otherwise

This approach allows us to give a graphical representation in the form of Payout curves for the Loss uplift metric:

Payout Curves

What we observe in the chart above that most of the payout is coming from the improvements in prediction accuracy for defaulters in the probability range > 70%. Due to a higher weight (€100) used for defaulted observations, we can see that our choice in selecting WOE LR would be warranted from a business perspective.

Total payout of using the challenger vs the baseline amounts to +€20,300.

Concluding remarks

Benchmarking PD models with similar rank-ordering power based on alternative metrics can help in identifying most promising models with higher payouts of model use. Utilizing a combination of scoring rules and business metrics enables risk practitioners to look beyond Gini scores in search of the best scoring model.


--

I hope you have enjoyed reading this post!???

The technical appendix with the code can be found in this notebook.

All views expressed are my own.

References

He, Xinran, et al. "Practical Lessons from Predicting Clicks on Ads at Facebook." Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. 2014. https://research.facebook.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/






Aniruddha Neogi

Principal Consultant, AI Innovation @ FICO | MStat in Applied Statistics, ISI

1 年

Good article! Quick question though. Shouldn’t logistic regression always come out as the winner when using the log loss metric? After all the log loss looks pretty much like the likelihood function which is minimized when estimating logistic regression coefficients. Payout curves are a good idea if pd models are looked at in isolation. When estimating expected losses dollar comparisons should ideally even out once EAD and LGD are considered. Finally, I assume choosing a holdout or out of time sample data for benchmarking always helps.

Philipp Gorokhov

Model Risk Manager / Data Scientist

1 年

Denis, thank you for the article. I have a couple of questions, some insights are not clear enough: 1) What's the idea of calculation "Loss Uplift". It's just difference between LogLosses of 2 models. Suppose models have equal ROC_AUC, then compare LogLosses and find better model. I've run your python script LogLoss_woe = 54.88%, LogLoss_xgb = 54.99%, Uplift - 0.1%. Do I understand correctly your message is "necessary to rank models by 2 metrics, AUC then LogLoss"? 2) I do not understand the idea behind Business metric. Why 100EUR, 50EUR? Guess it's expert estimate, but looks inapplicable for real cases. I'd develop the approach with exploit of limits at the moment of credit approval, different penalty depending on Probability difference and so on. 3) Looks like X legend in LogLoss figures are slightly incorrect. Ordered by Probability, not Score. Isn't it?

回复
Arjun Sunder S, CA,FRM

Credit Risk Models|IRB|IFRS-9|Basel III| Regulatory Reporting |Model Risk Management| ICAAP |Stress Loss Modelling| Econometric models

1 年

Thank you so much for sharing this. Very insightful

Niclas Celvin

Consultant | Risk Management | Model develpment

1 年

Thanks for sharing this, will definitely apply some of these techniques going forward.

要查看或添加评论,请登录

Denis Burakov的更多文章

  • Validating Tree-Based Risk Models

    Validating Tree-Based Risk Models

    Boosting is a fundamental concept in machine learning that has achieved remarkable success in binary classification…

    9 条评论
  • Scorecarding with Na?ve Bayes

    Scorecarding with Na?ve Bayes

    In the consumer lending domain, credit scorecards serve as the fundamental pillars for decision-making processes in…

    5 条评论
  • Balancing Risk and Profit

    Balancing Risk and Profit

    Understanding profit independently of risk is increasingly vital for lenders to create monetary value through proper…

    7 条评论
  • Building Random Forest Scorecards

    Building Random Forest Scorecards

    In the lending industry and credit risk research, a risk practitioner can often encounter Weight-of-Evidence logistic…

    6 条评论
  • Unlocking Lending Profitability with Risk Modeling

    Unlocking Lending Profitability with Risk Modeling

    In earlier times, access to banking services required direct in-person communication with a bank officer. The outcomes…

  • Understanding LGD Risk

    Understanding LGD Risk

    The Loss Given Default (LGD) is a credit risk parameter that plays an important role in contemporary banking risk…

    14 条评论
  • Leveraging Profit Scoring in Digital Loan Underwriting

    Leveraging Profit Scoring in Digital Loan Underwriting

    Traditional loan approval process relies heavily on consumers’ credit bureau scores, debt-to-income (DTI) ratios, and…

  • Exploring Interpretable Scorecard Boosting

    Exploring Interpretable Scorecard Boosting

    Credit scorecards provide lenders with a standardized and objective method to assess credit risk and make informed…

    6 条评论
  • Measuring the Benefits of Credit Risk Model Use

    Measuring the Benefits of Credit Risk Model Use

    When developing credit risk models, risk practitioners tend to focus on quantitative metrics such as the Gini…

  • Validating New Generation Credit Risk Models

    Validating New Generation Credit Risk Models

    Model validation can be described as a set of processes and activities intended to verify that models are performing as…

    5 条评论