登录查看更多内容

Benchmarking PD Models

Denis Burakov

发布日期: 2023年9月5日

When evaluating various scoring functions for the Probability of Default (PD) modeling, the most commonly assessed performance metric is the Gini score. The Gini score measures the rank-ordering power of the model, which helps determine how effectively the model can identify default observations based on the order of its predictions. A Gini score of 1 indicates perfect discrimination ("crystal ball" performance), while a score of 0 implies no discriminatory power. However, in certain scenarios, the Gini coefficient can be a misleading metric and could lead to a selection of a less optimal model.

Imagine a situation where two scoring functions, WOE Logistic Regression (WOE LR) and XGBoost, perform similarly in terms of the Gini score (Gini = 2 * AUC - 1):

ROC AUC curve for the two scoring functions

In this scenario, a common method of model evaluation beyond the Gini coefficient is to establish a classification rule based on a probability model and assess the outcomes using a cost-benefit analysis. This analysis involves assigning fixed costs and benefits based on the costs of misclassification at a predetermined cut-off.

Cut-off setting: not recommended for Credit Risk problems

Given that this approach requires setting a specific cut-off point for classifying a prediction as being correct (e.g., 4% or 50%), it's not guaranteed that a model selected using this method would perform similarly on the new data when the underlying distribution of the target variable is different from the training dataset.

Log Loss

One metric for assessing PD model performance that is not so often discussed is a logarithmic scoring rule, more known as Log Loss:

Log Loss measures the proximity of predictions to the true labels and is a common loss function in model estimation methods such as Logistic Regression and Gradient Boosted Decision Trees. Additionally, this metric is employed to assess the calibration of forecasts.

Log Loss assigns more weight to predictions that are "correct". When predictions deviate significantly, for example, when the predicted probability is 0.99 for a "good" customer (default flag is 0) or 0.01 for a "bad" customer (default flag is 1), we observe large loss values, indicating lower accuracy of our predictions:

Alternatively, if the predicted probabilities are close to the true labels of default event, the loss will be close to 0 indicating high accuracy of our predictions:

Below Loss curves for the two scoring functions with similar Gini scores are shown:

We can observe that Loss curves for our models are quite similar, but there are differences in the lower probability range for the defaulted cases (left chart) and in the upper probability range of the non-defaulted cases (right chart) where loss values are high indicating inaccuracies. Naive loss corresponds to a forecast of a sample average default rate (50%) for each observation.

Looking at these results it is difficult to draw a conclusion whether one model is preferable over the other.

Loss Uplift

There is a way how we can compare two Loss metrics to benchmark the two models using a Loss uplift metric:

The Loss uplift metric is conceptually similar to Normalized Cross Entropy from He X. et al. 2014 paper, but here we benchmark our challenger model against a baseline model and not a "random guess" or naive prediction model (e.g., assigning 50% probability for every observation).

Essentially, Loss uplift an arithmetic average of the absolute differences between Log Loss values of the two models for each prediction. Negative values indicate better performance compared to the baseline and vice versa.

In our example, we will benchmark WOE LR (challenger) against XGBoost (baseline).

Loss uplift can be calculated for each class separately and on the overall level, the results are reported below:

Uplift goods (class = 0): -0.20%
Uplift bads (class = 1): -0.02%
Uplift: -0.10%

The metrics above show that WOE LR has a lower error and has an improved performance for both classes.

Business metric

To visualize the Loss uplift, we will create a business metric based on a cost-benefit analysis with the following logic in mind:

We assign +€100 when our challenger model predicted a higher probability for Class 1 vs the baseline and -€100 otherwise
We assign +€50 when our challenger model predicted lower probability for Class 0 vs the baseline and -€50 otherwise

This approach allows us to give a graphical representation in the form of Payout curves for the Loss uplift metric:

What we observe in the chart above that most of the payout is coming from the improvements in prediction accuracy for defaulters in the probability range > 70%. Due to a higher weight (€100) used for defaulted observations, we can see that our choice in selecting WOE LR would be warranted from a business perspective.

Total payout of using the challenger vs the baseline amounts to +€20,300.

Concluding remarks

Benchmarking PD models with similar rank-ordering power based on alternative metrics can help in identifying most promising models with higher payouts of model use. Utilizing a combination of scoring rules and business metrics enables risk practitioners to look beyond Gini scores in search of the best scoring model.

I hope you have enjoyed reading this post!???

The technical appendix with the code can be found in this notebook.

All views expressed are my own.

References

He, Xinran, et al. "Practical Lessons from Predicting Clicks on Ads at Facebook." Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. 2014. https://research.facebook.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/

Aniruddha Neogi

Principal Consultant, AI Innovation @ FICO | MStat in Applied Statistics, ISI

1 年

Good article! Quick question though. Shouldn’t logistic regression always come out as the winner when using the log loss metric? After all the log loss looks pretty much like the likelihood function which is minimized when estimating logistic regression coefficients. Payout curves are a good idea if pd models are looked at in isolation. When estimating expected losses dollar comparisons should ideally even out once EAD and LGD are considered. Finally, I assume choosing a holdout or out of time sample data for benchmarking always helps.

1 次回应

Philipp Gorokhov

Model Risk Manager / Data Scientist

1 年

Denis, thank you for the article. I have a couple of questions, some insights are not clear enough: 1) What's the idea of calculation "Loss Uplift". It's just difference between LogLosses of 2 models. Suppose models have equal ROC_AUC, then compare LogLosses and find better model. I've run your python script LogLoss_woe = 54.88%, LogLoss_xgb = 54.99%, Uplift - 0.1%. Do I understand correctly your message is "necessary to rank models by 2 metrics, AUC then LogLoss"? 2) I do not understand the idea behind Business metric. Why 100EUR, 50EUR? Guess it's expert estimate, but looks inapplicable for real cases. I'd develop the approach with exploit of limits at the moment of credit approval, different penalty depending on Probability difference and so on. 3) Looks like X legend in LogLoss figures are slightly incorrect. Ordered by Probability, not Score. Isn't it?

Arjun Sunder S, CA,FRM

1 年

Thank you so much for sharing this. Very insightful

1 次回应

Niclas Celvin

Consultant | Risk Management | Model develpment

1 年

Thanks for sharing this, will definitely apply some of these techniques going forward.

1 次回应

查看更多评论

要查看或添加评论，请登录

Denis Burakov的更多文章

Validating Tree-Based Risk Models

2024年6月12日

Validating Tree-Based Risk Models

Boosting is a fundamental concept in machine learning that has achieved remarkable success in binary classification…

9 条评论
Scorecarding with Na?ve Bayes

2024年2月20日

Scorecarding with Na?ve Bayes

In the consumer lending domain, credit scorecards serve as the fundamental pillars for decision-making processes in…

5 条评论
Balancing Risk and Profit

2023年10月17日

Balancing Risk and Profit

Understanding profit independently of risk is increasingly vital for lenders to create monetary value through proper…

7 条评论
Building Random Forest Scorecards

2023年9月20日

Building Random Forest Scorecards

In the lending industry and credit risk research, a risk practitioner can often encounter Weight-of-Evidence logistic…

6 条评论
Unlocking Lending Profitability with Risk Modeling

2023年8月23日

Unlocking Lending Profitability with Risk Modeling

In earlier times, access to banking services required direct in-person communication with a bank officer. The outcomes…
Understanding LGD Risk

2023年7月17日

Understanding LGD Risk

The Loss Given Default (LGD) is a credit risk parameter that plays an important role in contemporary banking risk…

14 条评论
Leveraging Profit Scoring in Digital Loan Underwriting

2023年6月28日

Leveraging Profit Scoring in Digital Loan Underwriting

Traditional loan approval process relies heavily on consumers’ credit bureau scores, debt-to-income (DTI) ratios, and…
Exploring Interpretable Scorecard Boosting

2023年5月23日

Exploring Interpretable Scorecard Boosting

Credit scorecards provide lenders with a standardized and objective method to assess credit risk and make informed…

6 条评论
Measuring the Benefits of Credit Risk Model Use

2023年3月10日

Measuring the Benefits of Credit Risk Model Use

When developing credit risk models, risk practitioners tend to focus on quantitative metrics such as the Gini…
Validating New Generation Credit Risk Models

2022年11月21日

Validating New Generation Credit Risk Models

Model validation can be described as a set of processes and activities intended to verify that models are performing as…

5 条评论

See all articles

Log Loss

Loss Uplift

Business metric

Concluding remarks

References

Denis Burakov的更多文章

Validating Tree-Based Risk Models

Scorecarding with Na?ve Bayes

Balancing Risk and Profit

Building Random Forest Scorecards

Unlocking Lending Profitability with Risk Modeling

Understanding LGD Risk

Leveraging Profit Scoring in Digital Loan Underwriting

Exploring Interpretable Scorecard Boosting

Measuring the Benefits of Credit Risk Model Use

Validating New Generation Credit Risk Models