16 Risk Clinic – Refineries of the new oil (Part 1: Data Harvesting Strategy)
Frank De Jonghe
EY Partner, Lead application of modelling, analytics & AI to Risk & Compliance across all industries
Social network information makes the circle of friends discoverable of somebody that just committed insurance fraud. Cash flow patterns reflecting shop sales allow to forecast the expected usage of working capital credit facilities in real time but also enable early detection of deterioration in the company’s prospects. The intensity and tone (sentiment) of e-mail traffic between teams in the first and second line can elucidate the culture of consultation and challenge on risk topics.
These are just a few examples of how abundant data can significantly further the risk management cycle of Measure – Understand – Predict/Nudge. However, one needs to have the data, understand their provenance, and be legally allowed to use them. Preparing the risk management of the future therefore requires defining and implementing a systematic data harvesting strategy to be able to exploit this potential. It should be high on the agenda of every CRO.
In a later Risk Clinic we will dive deeper into the question of whether “we can” implies “we should”, and we will put a spotlight on data story telling, a skill set that banks can use more to improve risk reporting and hence risk insights. Readers may wish to go back to Risk Clinics 8 and 10 that touch upon related topics, framing my perspective. Links in the references.
Exploration for the new oil: collecting and harvesting data
Accessible external data sources have multiplied over the past two decades, both covering human individuals, companies and all sorts of processes, be they industrial or more administrative. Our on-line life leaves a Digitale Schleimspur, and the Internet of Things (IoT) allows for real time information on the physical world. System log files provide limitless raw material for business process mining.
It is clear that leaving the discovery of new, promising data sources and their use for convincing analyses to luck, rather than to systematic, targeted development, is not a road to success. Risk departments should have a conscious data harvesting strategy, covering the following activities:
-????????? Horizon scanning for promising data. This can be external sources, but also the potential to get new insights from internal data. For example, payment flows between SME customers may give a refreshing perspective on credit risk and credit contagion. This requires constant nudging of the employees to think out of the box, to make them Data Hunters. I was once asked to provide an external benchmark for collateral haircuts, while the company had as global custodian access to what different market participants use as haircuts per asset class through a service it provided to them. It was all there, waiting to be analysed. It is currently difficult to reliably quantify the impact of climate change on loan books, but its impact on real estate valuations is surely starting to come through in recovery data. Are these being collected and analysed? FinTechs, including payment services and online credit providers, are particular sources for inspiration in this regard.
-????????? Perform targeted and well-defined pilot studies to confirm the (potential) future value of new data sources for clearly defined risk objectives. This can include reliance on published success stories of peers and academics. For example, it is by now well-known that network centrality measures applied to payment flows between SMEs are good indicators for credit risk, but do they add incremental information to existing measures, e.g. for early warning or contagion risk? Pilots can take quite a long time, in particular if one wants to validate that certain newly available data are good risk predictors in different phases of the business cycle.
-????????? Define a collection, vetting and storage strategy of data deemed of future value. It is important to collect quite a bit of metadata, including the way the data was collected, to ensure that they become reliable raw material for future model build.
This experimentation and subsequent data storage comes at a cost. Learning from what worked for others is therefore a good attitude, as is the case for all innovation.
?
Learnings from the last decade of exploration
There is so much happening in the world of data that should be considered in such a data harvesting strategy. Each of these topics would merit its own deep-dive exploration. We mention only a few big themes here:
-????????? Among the areas to cover in the data horizon scanning, are the increasingly abundant Open Source data vaults. (Google even has a bespoke search engine for it). However, given their not always well-documented provenance and also uncertainty about continued availability (they often are of a one-off nature), their use may be limited to proving the value of certain data types in a pilot study. Long term reliance should not be placed on them without proper due diligence.
-????????? Another ingredient of a long term data strategy is to actively contribute or shape collective data gathering exercises for areas that are not competitive in nature. In market risk and financial instrument pricing there are several successful examples, such as complex derivative consensus pricing. To some extent credit bureau scores could be considered a collective data service. An example of a bottom-up industry collaboration in the credit space is Global Credit Data, a group of close to 50 banks pooling credit loss data. One could even go one step further and use federated learning to build a classifier (e.g. for a suspicious payment in fraud or AML transaction monitoring) on data from different banks without such data ever leaving the bank. This kind of initiatives though seem slow to take shape. Having consensus data as starting point, works for compliance, but is harder when you want to differentiate in the competitive space like credit underwriting. And then there is …
-????????? The ever-present concern about data privacy, when it comes to personal data, or data confidentiality for business related data. Apart from obtaining the necessary approvals (right of use, consent, data minimisation …), including through contractual means, to use certain data to feed models, it is of the utmost importance to be transparent about the intentions. This can actually be turned into a positive message to customers about how they are in control (see for instance The IKEA Data Promise for an example of compliance packaged into a great customer empowerment narrative).
-????????? Finally, a.o. because of the data privacy concerns, no data strategy is complete without the build up of a Synthetic Data capability, the capacity to build a fictitious data set that is free of identifiable traces to the original data set entries (in other words, addressing the data privacy challenge), but that would still allow to build “a very similar model” to the one built on the original data. Although there are many simple and complex approaches to synthetic data, I have so far not come across a detailed approach to manage confidently this trade-off between representativeness of the synthetic data and the degree of privacy that is achieved. Notice that data retention limitations because of data privacy legislation may also necessitate the use of synthetic data to extent the new oil’s lifetime.
A head of risk data analytics once told us: Our board has the vision that our risk models get updated every few weeks while we’re sleeping, like the navigation system of our Tesla. Those updates though, are driven by continuous data harvesting from the fleet in use.
?
?
领英推荐
Appendix: The four Vs of data with a banking twist
The abundance of data represents 4 challenges, with a twist for the risk professional in financial services:
-????????? Variety: Banks are traditionally used to data that comes in tabular, structured form. Text documents of all sorts, audio signals of customer conversations and images (e.g. in the context of KYC procedures) have added unstructured information to the mix. Moreover, techniques like network analytics have created new types of data to handle, but even more, a new paradigm to analyse relevant questions. Often too, the challenge is having the data in adequate granularity for a desirable risk analysis, or being able to aggregate different data sources in a meaningful way (e.g. the CDO look throughs during the GFC or the recent challenge of the PRA w.r.t. Private Equity ecosystems and their interlinkages).
-????????? Velocity: Data is available in real time, and structural patterns they reflect may vary (relatively) quickly. In banking, modelling happens with different data windows, with credit risk often looking at a full economic cycle, and market risk more focused on capturing current conditions. Customer propensities, fraud and transaction monitoring are in modelling approach closer to credit risk (classifiers), but can, like market risk, have more short term data and modelling needs to capture evolutions.
-????????? Volume: There is just so much data generated. Beyond the ones a bank needs to hold for running its business (like account movements), what should be stored for future use? Hence the need to actively investigate what could be useful.
-????????? Veracity: Is the data error free, consistent; and can it hence be used as trusted foundation to build a model on. Data sources linked to financial transactions/account movements can be presumed to be pretty accurate, and data needed to feed into regulatory reporting is subject to stringent data quality standards. Certain insights can be extracted from noisy data too (a data science technique called regularisation is like having noise in the data), but where will this signal from noise be just good enough?
Federated Learning
Federated learning is a model training technique that allows to determine the impact on a model’s parameters of given examples (learning instances) that are in distributed inventories, without bringing all such instances together in a big model training set. The standard example is the training of a spam filter that leverages millions of users pushing the SPAM button on their device on an ongoing basis, into one performant spam classifier. In essence, the algorithm uses the current model parameters to do the classification, observes the reaction of the user (confirmation of the forecast or contradiction) and then sends back to the central model an indication on how to gently tweak the model parameters to reflect this false positive or false negative (stochastic gradient descent). Even though the data is never leaving the decentralised premises, there is still some information leakage risk in this approach. Notice that such a set-up is ideally suited for continuous updating of the model and hence lends itself also nicely to domains where dynamics change on an ongoing basis.
Differential Privacy
If the inclusion of a (few) specific data point(s) influences a model’s reply on a given input, then probing the model could make an attacker find back (some) of those data. Technical definitions of differential privacy (and they are opaque…) then aim to quantitatively ensure that adding one record to a training set has negligible impact on the outputs one wants to generate from the trained model.
References
1. Two earlier Risk Clinics touched upon related topics:
On the role of the CRO as fact checker and story teller:
On cognitive psychology and the design of reporting flows:
2. Google even has a bespoke search engine for open source datasets: https://datasetsearch.research.google.com . Two examples of websites: https://ourworldindata.org (beautiful examples of visualisation); https://data.nasa.gov (quite a bit of earth linked data relevant for climate change context).
3. The IKEA data promise. https://www.youtube.com/watch?v=j1MsEl9cTRc
4. World Economic Forum, The next generation of data-sharing in Financial Services, White Paper, September 2019. Available at??? https://www3.weforum.org/docs/WEF_Next_Gen_Data_Sharing_Financial_Services.pdf
5. World Economic Forum, Unlocking Greater Insights with Data Collaboration, Briefing Paper, January 2022. Available at https://www3.weforum.org/docs/WEF_Unlocking_Greater_Insights_2022.pdf .
6. For a stock take on Federated Learning, see Federated Learning: Challenges, Methods, and Future Directions, Tian Li,?Anit Kumar Sahu,?Ameet Talwalkar,?Virginia Smith. Preprint available at https://arxiv.org/abs/1908.07873 .
?