Black Box Models : A Governance Approach
The deployment of AI, and in particular Machine Learning (ML) has been a hot topic in almost all business sectors in recent years. In the banking industry, one of the most common use cases cited for these are credit risk/credit scoring. While ML models are being used in certain lending segments, the actual operational usage of ML and AI in credit scoring at present is not prevalent, at least not among banks which are regulated. Some of the main challenges cited include:
- Inability to explain complex models, and how their outputs are generated
- Governance and transparency issues causing perceived resistance from both internal and external parties around the usage of such models
- Increasing customer concerns on data privacy, and being able to understand and challenge decisions made by banks
In order to deal with above issues, many organisations are undertaking research work to explain/interpret the more complex modeling algorithms. Complete model explainability has been a key part of model governance, and an important one, historically. However, with the increasing usage of more complex models in the risk domain, this policy, as well as other rules around governance of models, may need to be re-visited. Recognising that the modeling itself is only part of a much larger process, a more holistic approach that takes into account all the steps in model development should be considered to get over the explainability issue. In fact, what is needed is a satisfactory approach to model governance that assumes that the model itself is not explainable.
In this article, I present a high level framework that can be used to allay some of the main concerns around model interpretability and governance.
Lets start by defining the process of model development across 3 broad areas (understanding that model development is a complex process that has many more tasks):
1. Data Management : the process of acquiring data, performing exploratory work on it, data cleansing, variable transformation etc. that takes place before any model fitting is done.
2. Model Training : the process of taking the above data and applying predictive modeling algorithms on it. This is the area where the usage of black box models and their interpretability causes the most issues.
3. Model Output : the predictions/output of the model.
A comprehensive approach to model governance would cover all 3 areas to provide end users with increased confidence around black box models. Let’s consider each of these model development areas in sequence.
Data Management
If the model itself is very complex and cannot be explained, there are several ways to avoid some of the problems this might cause by understanding its input variables.
These include :
- Making sure that the data is clean, free from significant biases, does not contain any variables that can cause legal or ethical issues, or have dubious causality. I have seen, for example, development datasets that contain variables such as ‘likes’ and ‘follows’ of various media outlets. These are obtained from social media accounts, and can cause problems. In multi-cultural countries such as the UK, Canada, US and Australia, immigrants may choose to like and follow ethnic media, or religious ones. If these variables enter a model, you may be making decisions based on race, religion, ethnic origin and language, which would contravene laws in many countries. If all the input variables are deemed clean and free from the issues cited above, we can then accept that all of them can be viable candidates for the model. This should allay some concerns even if the content of the model is unknown. We recognise of course, that removing some of these obvious problematic variables does not eliminate the issue altogether as there may still be other variables that can indirectly cause legal/ethical problems (e.g. merchant codes for credit card transactions).
- Perform various tests to ensure that variables have explainable relationships to the target. This can be done via variable ranking algorithms as well as Weight of Evidence based groupings. The latter is useful in that it shows the nature of the relationships, as well as the statistical correlations, and can help get buy-in from business users. While recognizing that the more complex algorithms in ML employ complex interactions and relationships that cannot be captured by the simple bivariate analysis, this can at least demonstrate that the input dataset being used does not contain any data that has counter-intuitive or unexplainable relationships to the target. If all the data used in the model is of reasonable strength and can be explained, this would again allay concerns around the usage of black box models.
The above 2 steps would result in an input dataset that has been accepted by all stakeholders, including model development, model validation and business users. While we may not know which ones end up in the final model, we would at least have confidence that whatever has gone in, is of good quality, unbiased, legal/ethical and has strong explainable relationships to the target being modeled.
Model Fitting
This is the step where data from the above is used to build models. Some modeling techniques such as regression are more explainable than others such as Random Forests or Neural Networks. In cases where direct explanation of the model is not possible, there are various techniques available that provide proxies, including:
- Surrogate Models : more transparent models (e.g. points-based scorecards) are built to predict the outcome of black box ones. These can be done at a global or local level. A commonly used local surrogate model is Local Interpretable Model Agnostic Explanations (LIME )
- Sensitivity analyses : Inputs into black box models are perturbed/changed and the impact measured in various ways including changes in the prediction and errors. These include methods such as Feature Importance based on variable perturbation, Partial Dependence plots, Individual Conditional Expectation plots and Shapley Value .
All the techniques described above rely on approximations and have their own advantages and disadvantages. In particular, the use of simple models to explain a much more complex one can be problematic. However, in the absence of 100% certainty, we can use a combination of several methods to give us some sense of the variables that may be in the model, and their relative contributions.
In addition to looking at the variables, one would of course follow all the usual model governance around out-of-sample validations as well as reviewing model fit statistics.
Model Output
After checking the data and model, the third part of the process is to look at the output and perform some simple tests on it. The results of these tests are meant to give us a sense of comfort around the robustness of the black box model developed. Some of these tests are standard practice for most models developed in banks.
- Benchmarking : Compare the results from the black box models with those from simpler, explainable ones. These can include comparisons of model fit statistics such as Area Under the ROC Curve, KS, Gini, AR and so on, distribution of cases by risk/score bands, and measures of false and true positives/negatives depending on the underlying business case for the model.
- Stability : Check the overall population stability of both types of models across time (Population Stability Index is a common measure for this). We can only measure overall stability as that of unknown model inputs cannot be calculated. The black box models should not be significantly more unstable.
- Backtesting : Check the performance of the black box and benchmark models on historical data, as well as stability of the predicted values against actual for the same. This can be extended to include impacts on capital/approval rates or other relevant measures over time. In credit risk for example, we want to make sure that the model will not result in large swings in such measures. This is in addition to the ‘out of sample’ validation done as part of the modeling exercise
The approach I have outlined can help in improving confidence in cases where the model algorithm is a black box and cannot be explained. By validating all the processes surrounding the model, it should be possible to use such models with some confidence.
Naeem Siddiqi is the author of Intelligent Credit Scoring (Wiley & Sons, 2017),
Data Science & Model Development Executive
5 年Good points Naeem Siddiqi, especially around the importance of back testing, stability testing, and benchmarking.
Data Scientist at Point Duty
6 年Sir this is a million dollar help you did for the industry. when evaluation approach comes from someone like you it is so easy to manage the questions. I believe the killer is Surrogate Model: Local Interpretable Model Agnostic Explanations (LIME) but I have not see any paper where it works on structured data (i.e. credit risk model). Is there any reference you have on that ? Thank you very much.
Partnering with ?????????????? & ???????? ?????????????????????????? to ???????? ?????????? ?????????????????? and ???????????????????????? | ???????????? ?????????????? & ???????? ?????????????? | SAS Technology
6 年Very interesting piece, thanks Naeem for putting your thoughts together on this. While AI & Machine learning models bring new opportunities, we should not forget also the other side of the coin, the risks that come with these opportunities. And here we touch the Model Risk Management area.?? As a matter of fact, the increasing complexity of models & AI will be the number 1 challenge for Model Risk Managers in the long-run.??This is the outcome of our survey conducted with 25 representatives of leading european banks at our Model Risk Management Customer Connection in Frankfurt last week (details below). The ideas you bring forward here, can be the answers how the deal with extra workload with management of AI & Machine Learning models.
Credit Risk Modelling Lead,HSBC|| AI/ML || Model Risk Management || Quantitative Modelling
6 年Very well summarized. Through advanced ML techniques we can also enrich modelling data because most of the time we will not have enough historical data(e.g LDP, Fraud and in AML) due to multiple reasons and ultimately it affects on Model accuracy and stability. But nowadays with help of advanced technologies and ML algorithms we can make that data(Audio, Video, Text) in readable form and can extract meaning information. As you mentioned it could be unethical to use those open source freely available data because it may affect on strategies. But you dont think we will miss huge behavioral data in Banking decisioing?. How do you think we can leverage this data in regulatory modelling environment?.?