Validating Tree-Based Risk Models
Boosting is a fundamental concept in machine learning that has achieved remarkable success in binary classification tasks, which constitute the majority of commercial applications. By leveraging ensembles of decision trees, boosting excels in delivering outstanding performance. It effectively addresses common challenges such as handling missing data, managing outliers, and modeling interactions.
Among the various forms of boosting, particularly prominent are gradient boosted decision trees (GBDT). These models are celebrated for their superior discriminatory power on tabular datasets compared to parametric models. However, a major challenge in fully benefitting from these powerful models in risk management tasks is, arguably, model validation (EBA/REP/2023/28 , Chapter 4).
What is Model Validation?
Validation in the context of risk management is a critical step ensuring models are safe and ready for production use. SR 11-7 defines model validation as a set of activities intended to verify that models perform as expected, which generally includes:
You can read more about validation approaches to ML models here ?
Throughout this process, validators form their opinion on the model and its assumptions. This process differs from standard model back-testing (sometimes also called validation in non-regulated ML environments), where the process is usually more straightforward and does not involve external verification of the model construct.
Model Validation for High-Risk Models
For high-risk applications of ML such as credit scoring, portfolio risk management, and other finance use cases, an independent validation is essential for robust model risk management. Per regulatory guidance, validation must cover the whole scope of model development.
In recent years, there has been growing interest in the model risks associated with ML models for high-stakes use cases. This heightened interest largely coincides with increased regulatory attention, as seen in frameworks like GDPR, the EU AI Act, and the NIST AI RMF.
When working with more complex ML models, validators may face additional questions, for example:
Among these challenges, model performance monitoring for ML models can be considered to be key. When working with models like gradient boosted trees (GBDT), validators may require technical tools and familiar metrics to perform this assessment. This article proposes a method for assessing GBDT models' performance using the chi-square (χ2) test of independence. The approach is applicable to other contingency-based metrics (for example Somers' D ) which would lead the validator to similar conclusions about potential model degradation.
Validation Approach for Tree-Based Risk Models
GBDT model performance can be validated at the sub-model (tree) level. The tree structure can be represented in the form of contingency tables formed by the bins (leaves) of the tree. One common test for contingency table analysis is the chi-square test of independence between categories (tree leaves against labels).
The chi-square (χ2) statistic is a measure used to determine the likelihood that observed differences between sets of categorical data are by chance. The chi-square (χ2) statistic was introduced by Karl Pearson (1857–1936) and was advanced by Ronald A. Fisher (1890–1962).
A gradient boosted model consists of a sequence of boosting iterations, each represented by a decision tree estimated on the residuals from previous boosting rounds. For classification, the raw model predictions, known as margins or leaf weights, represent the log odds of class membership and are cumulatively added across boosting iterations to produce the final prediction with a sigmoid function (logistic regression).
This approach, however, does not rely on residuals and uses the observed data for a chi-square test viewing each tree as a stand-alone model (sub-model).
Consider an example based on the first tree in the ensemble model showing the distribution of training data points within buckets:
This table describes the risk classification of a decision tree where the first split (Delinquency < 98%) allows to differentiate the majority of Class 1 (default) cases. This can be seen through a conditional probability of x < 0.98 | y = 1 (674/700). From this contingency table, we can calculate the chi-square (χ2) statistic, which equals 1020 (p-value < 0.001). The high value of statistic indicates that the observed data is very different from an expected distribution (breakdown ).
When validating the model on a new dataset (e.g., an out-of-time dataset), the validator's goal is to verify that the model trained on historical data reflects current risk adequately. A chi-square statistic of 1.895 (p-value = 0.169) for the new data might indicate that the model does not adequately represent the population anymore and may cause an adverse selection of risk.
It is important to note that with a growing complexity of tree-based models (e.g., larger depths of individual trees), the test discussed above may become unreliable because with more categories, it is more likely to observe a significant outcome. Hence, a penalized version of the chi-square statistic can be used:
领英推荐
This adjustment accounts for the increased likelihood of spurious significance in more complex trees by penalizing the chi-square statistic based on the complexity of the tree.
The general idea can be extended to testing the goodness of fit of all trees in the model to get an average chi-square. Real-world applications of this approach can address performance monitoring across different products covered by the model application that were not part of the training dataset. Additionally, such testing can be extended to sub-populations or slices (e.g., revolving/non-revolving credit).
Step-by-Step Workflow
Below we outline the step-by-step approach to testing the goodness of fit for tree-based risk models.
1. Model and Data Preparation:
A fitted GBDT model, training and validation data are provided.
2. Validation Using a New Dataset:
To validate the model, a validator uses a new dataset that hasn’t been seen during training, for example this could be an out-of-time (OOT) or out-of-sample (OOS) data, or a combination thereof.
3. Running Through Each Tree:
4. Calculating Goodness of Fit (Chi-Square Test):
The overall fit of the model can be assessed by calculating an average chi-square (χ2) statistic across trees (all trees are weighted equally unless we fit a logistic regression on top) and its corresponding p-value, indicating whether the model maintains the desired stability of risk differentiation.
Below we show an example of model performance degradation by ~30% gini score points reflected in the validation summaries for an XGBoost logistic regression model:
Training Data - Average Chi-Square: 280.06
Training Data - p-value of the average Chi-Square: 0.00000
Training Data - Percent of non-significant trees: 12.94%
Validation Data - Average Chi-Square: 55.48
Validation Data - p-value of the average Chi-Square: 0.99452
Validation Data - Percent of non-significant trees: 37.65%
It is also interesting that there are some trees in the original model that do not describe the data well (13%) based on the chosen significant level (e.g., 0.05). With the use of boosted scorecards, such trees can be removed from the model via an exclusion from the overall score.
Summary
As data for predictive modeling becomes more complex and diverse, integrating rigorous, data-centric validation methods becomes increasingly critical. By using a familiar suite of validation tests for tree-based risk model validation, developers and validators can leverage well-established instruments to validate and assess the performance of predictive models.
This data-centric approach ensures that the model accurately describes the data-generating process, especially when applied to new, unseen data. Embracing robust validation practices ensures that advanced risk models maintain their predictive power and stability, ultimately safeguarding decision-making in high-risk scenarios.
--
I hope you have enjoyed reading this post!???
The technical appendix with the code can be found in this notebook .
All views expressed are my own.
Explore more on #CreditRiskModeling , #Lending , #ModelRiskManagement , and stay updated by subscribing here: https://linktr.ee/deburky ??
?????????????????
Perspektiven wechseln, verbinden, kreative L?sungen finden
5 个月Dear Denis, very interesting post! Especially with growing importance of advanced machine learning technics, the specter of validation tests must be enhanced? - thank you for sharing, it can be very useful!
Chief Executive Officer
5 个月Interesting!
Data Scientist | FinCrime Prevention | Credit Risk Modelling | Ethical and responsible AI
5 个月The fact that some trees are not significant during the training of the boosted scorecard it is an evidence of additional variance in the model. Wouldn’t be best to drop these features from the training data? Unless, of course, there is regulatory or strong business rationale to keep them. Something is not clear to me though. Assuming that the trees are fitted on the feature level not assuming interactions with other features, and depending on the depth of the trees we might encounter contingency tables 2 x n categories. Usually these categories would represent the WoE bins used in the traditional method. Therefore, averaging the chi squared statistics and calculating the p-value of the averaged statistic would not represent the different degrees of freedom of each statistic. It would make sense if all trees leaves can be represented in a 2 x 2 contingency table format, but what if one of the trees has, for example, 4 x 2 contingency table?
Senior Associate - PwC US-Risk Advisory || Ex-Accenture AI || JU ECO'20
5 个月Thank you for such an informative article! Can this be evaluated/implemented for lightGBM models? As I see, the package 'xgbooster' used for this analysis is based on xgboost.