14 Risk Clinic - Upgrading MRM from Logistic to ML

14 Risk Clinic - Upgrading MRM from Logistic to ML

When a question comes up twice in a fortnight, it’s worth addressing. What extra tests should be done when machine learning models are used instead of classical logistic regressions?

A tradition originating in the nineties when computing power was limited, and subsequently engrained and canonised through regulatory rulebooks, has made the logistic regression model the de facto standard that everyone working in credit risk modelling, is deeply familiar with. Over time, a shared understanding of what good validation and monitoring looks like for such models has developed. Increasingly data science or machine learning (ML) models are being used for the steps in the credit life cycle that are less constrained by the regulatory canvas of Basel IRB or IFRS 9, such as underwriting decisions, marketing decisions, early warning signals and collateral valuation, leveraging on both the wealth of data and the readily available machine learning model codes. Other tasks that are in essence classification processes are increasingly supported by similar models, including in transaction monitoring, fraud detection, …

Given this ubiquity of more complex models, it is worth setting out a few model features that everybody should be aware of when moving from “old school” classifiers to working with big data driven ML ones. Where do we need to pay extra attention to ensure the model continues to operate as intended? Models are built ideally to support business decisions that matter for the organisation, a principle also recognised in SS 1-23 to signal the scope of its application. The use of (gen)AI to make all sort of processes more efficient, in essence also quickly leads to the need to understand the false positive/false negative trade-off in the different decision and branching points of a process flow (see Risk Clinic 2). The good news is that the outcome risk profile as defined by the use case or business decision is not changed by the use of a different, more complex algorithm, even if the process itself becomes more risky. It does mean though that the control environment needs to be scaled up, and the below sets out some of the things to consider.

By using the notion of classifier, I limit myself here to settings where there is a binary fork in the process flow: underwrite the customer or not, autoclose the transaction alert or not, involve a human operator in the customer request or stay in the automated straight-through mode, …

What stays the same?

·?The essential trade off remains the false positives/false negatives count that is traditionally captured via the confusion matrix for a classifier or the ROC curve and Gini coefficient for a continuous variable (score) that is translated in a classifier by the choice of a threshold.

·?Understanding the model’s behaviour on different segments of the population remains as important as usual. Machine Learning models often accommodate non-linearities (different behaviour in different segments) more readily within one model, but that does not exonerate the user from understanding such model performance. This is particularly relevant if one needs to understand potential decision biases (gender, ethnicity, …) in the portfolios.

·?Business decisions typically use a model’s output, the model’s output is not the decision. In credit modelling, this is called the use-test. Understanding the “decision accuracy” before and after interpretation and potential overriding of the model output by the human-in-the-loop, and understanding the interaction between model and human expert, is key. Notice that deciding to use a model’s output for straight-through automated decisions is also a business decision.

·?When calibrating the thresholds for a continues score to be used as a classifier, the (in)balance between the cost of false positives and false negatives determines the theoretical optimal threshold. Of course, those costs may not be accurately known. For example, in transaction monitoring within financial crime procedures, a false negative may have a multiple of the cost attached to it relative to a false positive that may only trigger some extra procedures and operational expense.

What requires some extra attention, reinforcement of the control environment?

On top of the tests and considerations that are familiar from the logistic model world, there are several characteristics of machine learning models that require the control framework to be reinforced, both in a formal validation phase and for ongoing monitoring during use. For each of the below, model development and model validation documentation should provide adequate understanding, justification and challenge:

·?Learning approaches (the recipe used to determine the model’s parameters) are much more varied for ML models as compared to the logistic regression. ML models will tend to exploit the available big data, using multitudes of features that are not a priori downselected on the basis of their individual predictive power. A technique called “regularisation”, which introduces a penalty if the classifier uses too many feature variables, is often used to keep the models focused on the most information rich variables. Moreover, training data may be used as they arise (get generated) through time (e.g. click through rates on a recommender model), leading to potentially constantly updated and evolving models. ?

·?The “penalty” parameter used in the regularisation is but one example of what are called “hyperparameters” in the machine learning model. These can also include such things as the depth in a modelling tree, or the number of simple models in an ensemble approach, … Each of these parameters provides a control knob of the model that can impact model performance. In theory they too should be calibrated using a two stage procedure.

·?Explainability. From the outset, there should be an approach identified to understand as best as possible the model outcomes. This can be at the aggregate level, but for certain applications, legislation (like GDPR) may require the explanation to be available at the level of each model application/decision. It is beyond the scope of this text to explore the different techniques for explainability, but they go under names such as Shapley values, partial dependency plots, feature importance analysis, ?surrogate models which may approximate a ML model locally with a linear (logistic) model, …

·?It is necessary to understand well the interpretation of a classifier’s output. In particular the interpretation of the probability to be in one class or another requires careful consideration.

·?ML models and ML modelers typically seem less focused on data quality and data curation than what we are accustomed to in the regulated credit modelling world. While this is partly driven by the volume, velocity and variety of data, this also relies on some theoretical insights. Regularisation, the recipe used to limit the number of features used in the modelling, can be shown (in least square settings) to be equivalent to having noise in the training data. Intuitively, if one starts heating up a magnet (introducing noise) it becomes at some point a demagnetised piece of iron (lost predictive power of having aligned micro-magnets). The idea is that some degree of noise in the training data is manageable without losing too much model performance, but at some point the signal is swamped by the noise.

·?Code implementation. Many ML models are intricate complex algorithms, for which different packages may be available. Benchmarking the package used with other, similar, ones on the same data is a must.

Where does that leave us from a business decision perspective?

It should be clear from the above that ML recipes and modelling approaches will typically not give “the one true model” that one is used to from credit logistic model applications (or so we are led to believe). There is a cloud of models, neighbours in model space as parametrised by the model’s parameters, that one cannot choose a preferred model from readily. Yet, even if models provide a continuous variable as output (e.g. a score), such a variable is usually mapped to a discrete variable (e.g. a credit rating grade) that drives the actual business decision (e.g. reject underwriting below rating X, light touch follow up investigation for a medium risk transaction alert, …). It is this ultimate output/decision that matters, and a slightly different model (neighbour in the model cloud) may lead to the same such business output. This is the level that determines whether a technical modelling issue or choice is actually a business issue. Understanding the impact of the modelling choices on the final business decision outcomes is key in focusing the technical validation activities on the most relevant aspects of the model development process.


References and further reading

1.????? Bank of England PRA, Model Risk Management Principles for Banks, Supervisory Statement 1/23 May 2023. It states that “Model use is defined as using a model’s output as a basis for informing business decisions” . Moreover “Business decisions should be understood as all decisions made in relation to the general business and operational banking activities, (…)”

2.????? For my pragmatic approach to Fair Algorithmic Decision making, see the first of a thread of 8 posts here: https://www.dhirubhai.net/pulse/fair-algorithmic-decisions-18-frank-de-jonghe/?trackingId=ELtt2utettJYSWJsCjkPGA%3D%3D

3.????? For a very good Review into Bias in Algorithmic Decision-making by the Centre for Data Ethics and Innovation, see https://assets.publishing.service.gov.uk/media/60142096d3bf7f70ba377b20/Review_into_bias_in_algorithmic_decision-making.pdf

要查看或添加评论,请登录

Frank De Jonghe的更多文章

社区洞察

其他会员也浏览了