13 Risk Clinic - Time for a Large Credit Model?

13 Risk Clinic - Time for a Large Credit Model?

Have you ever wondered why every bank is (expected to make/making) its own PD model? The regulatory expectation to “use your own data” is deeply ingrained. It is rooted in the vision that models should support risk management and hence reflect the actual portfolio and practices of the bank as much as possible. When a new bank is created, it has no internal loss data and hence struggles to build robust models. Just imagine that a newly founded hospital would have to wait to collect its first batch of cancer patient data before it can make an approved diagnostic tool. By benchmarking the bank sector’s approach with a recent case study in medical diagnostics, I try to gauge whether an alternative approach of “standard models” might be feasible and worth the effort.

Upfront apologies, it looked like a simple contrast and compare, but as I started unpacking the comparison; the text expanded. I would have made it briefer if I had had more time.

Predicting the likelihood that somebody will get cancer, or detecting the onset of the disease in an early stage, is important to minimise the impact on the patient. For example, a team of radiologists and deep learning experts at MIT, led by Regina Barzilai, have been aiming to significantly progress the early detection of breast cancer using mammogram images. In their 2021 paper, they describe some of the challenges that need to be addressed: validate forecast models across diverse populations (different banks), demonstrate how their predictions can improve clinical workflows (use test, recoveries), preferably predict risk at multiple future timepoints (PiT and TtC), make up for missing risk factor data and produce prediction accuracy that does not depend on the mammography machine used (data quality). The 2021 paper was trained on MIT/Boston data and back tested on data sets of a Swedish and Taiwanese hospital. In a follow up paper they extended the back test to several more institutions. They have open sourced their MIRAI model, and are calling for ever more hospitals to collaborate to validate their model over an ever bigger population and hence underpinning the claim to being a universally useable model.

Even though the approach to build one model and use it everywhere may feel natural in this medical-biological setting, it is important to bear in mind that genetic differences, phenotypical differences (lifestyle) and measurement differences (the mammography machine being used) can add a lot of variation to the data population being studied.

For PD models (at least for the regulatory ones), one can distinguish two goals, (i) the risk differentiation, helping the user to rank order credit risk from good to bad and the probability of default from low to high, and (ii) the risk calibration, ensuring that the absolute estimate of the PD is accurate. The bank specific data are used for both steps, the identification of the risk drivers to include in the rank ordering, and the calibration step. While differences are seen between the statistically most significant risk drivers (portfolio specific, obviously …) from modelling work bank to bank, one can readily imagine a world where the risk ranking model is standard (after all, one would expect some uniformity in what causes a client to default), and where the risk calibration of the absolute default level is bank specific. This bank specific risk calibration should also reduce the risk of regulatory arbitrage inherent in one size fits all approaches.

?

Quantity and type of data

At first sight, the medical models are much more “big data” models than the PD models, using as they do the unstructured imaging data, and hence hundreds or thousands of features (that may be hard to interpret) in a vector format. Images from different machines may have different granularity and can contain invisible imprints of those machines that are picked up by the deep learning models, but one overall would expect them to be rather homogenous.

For historical reasons, bank models have been built on a limited set of features. However, unstructured and big data in banking are increasingly used, and show promise. The most obvious data are movements on the deposit accounts or cash flows as they are the closest one can come to real time measurement and analysis of a debtor’s financial health. For consumer credit, open banking has enabled the sharing of data (with consent) and it has been used to identify pockets of good credit risk in pools of debtors that would be classified near-prime using credit bureau data. While this is usually done through a mixture of numerical feature engineering and credit lending knowledge, this is not too dissimilar from the deep learning based vector embeddings of the mammogram images (image encoder) that codify the key features of the images that the model subsequently uses or their combination with structured clinical data in hybrid models. Another example involves using cash flow fluctuations of SME clients to forecast financial distress significantly ahead of time. Perhaps such behavioural patterns and the “signatures” they contain may in fact prove universal indicators of default risk. Heuristics of this kind definitely are used in the credit risk world, for instance the notion that a credit line will likely be drawn maximally as a company drifts into default.??

?

Origin of the data and its representativeness

The approach taken by the MIRAI team was to train the model on the data from one hospital, and then to back test (validate) its universal applicability on different, diversified populations at other hospitals. There are two reasons why such an approach may be preferable to first creating a super data set to train the model on, covering multiple populations at once. First of all, the amount of confidential patient data that is moved around from place to place, is minimized in this way. Additionally, the fact that the patient experience is defined hospital by hospital, also means that the data gathering processes and hence data quality procedures, are far from homogeneous. In particular also the target variable, a positive diagnosis, partly determined by human expertise, may not be unequivocal.

In banking too, aggregating data from different banks is made all but impossible by personal privacy considerations but also by commercial considerations (the customer population profile of a bank/competitor). A distributed, federated model training approach is definitely possible (see appendix), although leveraging stochastic gradient descent methods to do so, does still represent a risk that confidential information is being leaked. Synthetic data may be another potential road forward. Regulatory definitions, including default values on days past due, do help instill some consistency on the default flag, the target variable.

?

Selecting and understanding the risk factors or explanatory features

The algorithms for mammogram analysis are based on ad hoc correlations between some synthetic feature based on the image and the cancer diagnosis mostly. Credit models in contrast will from the outset only use variables that are at least intuitively somewhat related to default risk (for example, the average temperature in the month a credit was originated is not included in the data set to begin with). Moreover, the recurring image features that are indicative for a beginning cancer (e.g. shapes, coloration, texture, …) have very little if anything to do with the causal understanding of the disease. On the other hand, for different asset classes (types of loans), standard candidate lists of explanatory variables of default risk exist that are based on intuitive understanding of the causes and warning signals of default. For example, when assessing the credit risk of a corporate, the different financial ratios of a company’s balance sheet and income statement are invariably relevant.

Despite this at first sight higher potential for standardisation (at least for the risk differentiation phase) almost all credit score card models start with a univariate analysis, whereby from a long list of possible features, a subset of, preferably independent, explanatory variables is selected to build the scorecard with. Invariably banks end up with similar but different scorecards. Subsequently, good practices on model governance then imply that models are subjected to validation, and sometimes detailed technical discussions on length of time windows for data sets used, or over which features to keep in the limited list, … A laborious process.

How these different modelling choices impact the subsequent process flow, is rarely considered when assessing the materiality of these different model choices. Usually, the risk differentiation model leads to a limited range of credit rating grades, and it are the latter that are used to drive the subsequent actions (e.g. need for second opinion in a credit underwriting, need to ask for extra collateral, …). There are multiple models that are materially mapping to the same rating grades and hence the same downstream business decisions. Are they really worth differentiating and maintaining separately?

Finally, there is an elephant in the room when it comes to the choice of features for credit default risk ranking, namely the point in time (PiT) vs. through the cycle (TtC) debate. This pertains to the idea that credit risk can be assessed either taking into account the current point in the business cycle, or on the contrary, irrespective of where one is in that cycle. Remember, the TtC PD of one entity can still evolve over time because of idiosyncratic risk. In practice, features capture both effects in some hybrid way, and linear models may not be well suited for modelling the complex dependencies. For example, at the top of the business cycle, the leverage of a company may have little impact on the default risk, while at the bottom, it may be an important risk driver. Two strategies a priori seem available to build models that are better at capturing the interplay between economic and company specific risk factors. The first is a head-on big data strategy, covering longer time windows of observed defaults. The conceptual challenge here is that the idiosyncratic risk factors may not have a stable relationship with default over say a 50 year window, as the economic fabric evolves. The second approach leverages causal graphs to “impose” more structure and insight on the actual default process and perhaps allows for shorter time windows in the training data.

Compared to interpreting mammogram images, it may just be that credit forecasting is an inherently point in time challenge; rendering all attempts to make universal models difficult. But overall it feels there is room to progress from trend fitting to embedding structural insights.

?

Economic incentives

Improvements in medical diagnostics have no impact on the access of an individual to medical care. Also, hospitals are mostly not getting more revenue for better diagnostic capability (unless private sector…). On top of onerous regulation in that sector like in banking, there are therefore not many incentives for every hospital to make their own diagnostic model that is just a bit more accurate on their own patient population.

Banks in contrast have more economic incentives, including capital cost and identifying the best credit risks in homogeneous pools. In particular in the evolving world of consumer credit, where through open banking more data are available for credit assessment and where price comparison websites are making their credit cost very transparent for consumers, a modelling arms race is ongoing.

However, when starting a new bank or lending organisation, being able to start from some form of standard models would add value to the industry. Of course; rating agencies do fill part of that gap, but not entirely.?

?

Ethical and Legal challenges

There are many ethical and legal challenges that are similar in nature between credit and the medical imaging case, though the consequences are arguably more important for the medical industry. Let us name just two. Bias in data sets can get picked up by models, and will subsequently be perpetuated. Interventions in the different model development steps to address this are well known, but require care and proper governance in their own right. The interplay between the algorithm risk assessment and the expert’s opinion (on credit or disease risk) is also a common topic. It goes without saying that the algorithm-human interaction, including its psychological aspects, is of an altogether different nature in the medical imaging case. Overriding a second opinion of a colleague-expert is one thing; ignoring an AI’s assessment may be something else when a patient’s wellbeing is at stake.

?

Conclusion

There are several historical and good reasons why every bank makes its own credit score card. The example of the medical image interpretation AI shows that taking a different approach, aiming for a standard model with wide applicability, is not a priori a lost cause, even if the challenges are big. Maybe we can at the same time achieve better risk management and lower operating costs by some level of model standardisation in the financial industry. Financial services regulators have an important duty here to consider the potential and enable change.


Afterthoughts. A.k.a. Appendix.

1.????? Disease prediction

To predict the risk of disease, both structured and unstructured data can be used. The former are all sorts of risk factors, reflecting medical and clinical knowledge or describing the individual, such as biometrical data. The latter are of course medical imaging data. They are converted to vectors that capture the essence of the image, using image encoders. Subsequently, classification models can be built using each of the two data types separately, or they can be hybrid, combining both. The model of the MIT team also aimed to forecast some of the most relevant structured risk factors from the medical images (imputing missing data), aiming to build a solution that performs under different circumstances.

2.????? Federated learning

In banking, one usually calibrates models using a batch of labelled data. Machine Learning, and in particular its web based version, pushed to the fore the power of learning through a method called Stochastic Gradient Descent that updates model parameters progressively using smaller sets of training data as they become available. This machine learning technique also could allow for a scorecard to be calibrated using the combined default history of different banks, without the underlying data ever leaving those banks. This is the core idea behind federated learning, a learning technique that tries to aggregate e.g. the classification of spam by thousands of users, each in their own mailbox, into one, high performing spam filter. To make the whole procedure even more secure from a data privacy perspective, random noise that cancels out when averaged can be added to the data packages that are exchanged between the participating banks and a central model aggregator, which is called differential privacy.

?

?

1.????? https://jclinic.mit.edu/mirai/

2.????? https://www.csail.mit.edu/news/robust-ai-tools-predict-future-cancer

3.????? Yala et al., Sci. Transl. Med. 13, eaba4373 (2021).

4.????? Yala et al., Journal of Clinical Oncology, Volume 40 Number 16,? https://doi.org/10.1200/JCO.21.01337

5.????? The European Banking Authority, EBA/GL/2017/16, 23/04/2018, Guidelines on PD estimation, LGD estimation and the treatment of default exposures.

6.????? Carlehed and Petrov, A methodology for point-in-time–through-the-cycle probability of default decomposition in risk classification systems, Journal of Risk Model Validation Vol 6, No 3, 3-25, 2012.

Bespoke models drive the business. General models will put credit providers at risk of disintermediation, much like utility providers.

回复
Alan Forrest

Model Risk Professional

4 个月

Frank. I enjoyed your article - so full of ideas and challenging comparisons between credit risk management and medical treatment. I think there is always great benefit in contrasting different modelling cultures. Your case for basic universal models is also pragmatic especially for start ups. I wonder if the comparison with medical modelling needs some caution however. Success for a medical trial is at quite a different end of the probability scale to credit risk. The medics are happy if an 80% mortality rate is reduced significantly to 70%. The credit modellers are refining a 3.2% predicted default rate by 10bp. This reflects the different maturities of the two fields. Credit is scraping an almost empty barrel of new information - ML is making these last efforts possible. But Medicine is still sorting out the first order understandings and procedures - still scattering possible drugs on animal models of diseases and looking for a signal. So sharing trial information is critical to any kind of progress there. On the other hand the tiny pieces of insight that banks can bring to improve credit models are their competitive advantage and closely guarded. Some first thoughts and I hope interesting.

Frank De Jonghe

EY Partner, Lead application of modelling, analytics & AI to Risk & Compliance across all industries

4 个月
回复

要查看或添加评论,请登录

Frank De Jonghe的更多文章

社区洞察

其他会员也浏览了