Discover or Learn Patterns?
Aditya Khandekar, CFA
Chief Revenue Officer I Analytics & Strategy Leader I 3AI Thought Leader I Fintech Enthusiast
Just think about it for a sec.... the question is non-trivial!
As a business leader who wants to leverage analytics in a new initiative, this question always come up bluntly or subtly. The dilemma here is:
1. Do I have enough scenarios of patterns already identified from my business knowledge or processes which I can amplify using machine learning? OR
2. Do I have a broad definition of the problem (which is real) but very few scenarios concretely available to drive analytical solutions?
In case of “1”, supervised methods are applicable, in case of “2” you will need to use unsupervised methods to understand patterns and develop decisioning based on it.
So what’s the big deal? Let’s try and explain this with an example. Let’s say I am looking for fraud patterns from traditional channel (in-store purchases) and digital channels (like mobile) for a retailer.
Discovering versus Learn Patterns for Digital Fraud
Let’s say the retailer recently (6 months) launched the digital channel for selling pet food. The retailer doesn't have a lot of experience around fraud in this channel.
Issues
a. Since digital fraud scenarios are new I might use clustering or outlier detection techniques to understand patterns of customer purchases which might be considered outliers. I might also use time series event modelling like Markov Chains or Recurrent Neural Nets to understand customer behavior temporally to see anomalous behavior. The issue is I don’t know if outliers identified are really outliers?
b. The analytics team then needs to go back to the business SME’s (domain experts) and ask them to manually verify and “tag” these outliers for them, which is an unexpected additional burden especially if the volumes to analyze are large. Why is this important?
c. The reason tagging becomes important is that business is nervous to put such systems into production where the risk of False Positive is high and its adverse impact on customer experience.
d. Essentially the business & analytics teams are flying blind and have to make a “leap of faith” that some mouse-trap is better than none! The unsupervised approach then needs constant refinement and re-learning based on fraud data being collected post deployment to make it more effective in capturing fraud universe (sensitivity of model) and quality of detection (precision of the model)
Resolution
a. Challenge the business team and the analytics team to see if you can break the problem down into a series to narrow footprint analytical problems for which you have a reasonable understanding of fraud behavior (even if it is by proxy). You might not catch all the fraud, but its better to machine learn from patterns in existing data versus trying to discover them. In our case example, there might be some cross-over fraud patterns from the in-store world (like payment fraud or item return fraud) which might be applicable to digital channels. Build supervised models to capture this behavior and get immediate business impact. Manage false positive carefully through descriptive analysis of non-fraud and build business rules which overlay on top of model scores to reduce False Positives with minimal impact to fraud detection.
b. Go out and collect data from controlled experiments and observe/analyze fraud behavior. Yes that means you might need to wait for 3-4 months till some patterns start to emerge, but that might help create a better mouse-trap downstream.
c. See if you can purchase external data at point of digital purchase (for example ID Vision from TransUnion provides a device risk score) to augment your feature set for prediction.
Broadly speaking I see the unsupervised approach as being "transient" in nature, you will eventually migrate to a supervised approach once you have sufficient data which is tagged and you understand fraud patterns well. We have also built semi-supervised models which sequence clustering with supervised models to drive higher detection rate and lower False Positives.
At Scienaptic we are working closely with clients and helping them navigate such issue for delivering real business impact.
Appreciate your feedback/comments and how you are dealing with such issues in your analytical journeys?
PI Data & Analytics at Travelers
7 年Really insightful and relevant post Aditya. We were just discussing the challenges of this with a team the other day and while third-party data sets are useful, it still adds even more data into an already complex situation. I would add too that some business teams are somewhat frightened by knowing what they don't know - i.e., will the discovery you're proposing have an adverse impact on my results in the future? Am I signing up for something that may well paint me into a corner? Regardless, I like the path you're proposing and see many applications for that direction. As always, great insights and love to hear more.
Strategy & Analytics at Faire
7 年Great post, Aditya. I like your description of an unsupervised -> data annotation -> supervised journey. Lack of labeled data combined with vast quantities of data continues to be a challenge in many contexts. I see Active Learning as a promising area of innovation to reduce human time required to comb through and annotate cases, many of which will be FPs https://drive.google.com/file/d/1Mx45sFHG5cOPMHmEF6u_CmxVG-_7TPKe/view
Managing Director at JPMorgan Chase & Co., Wholesale Payments
7 年Aditya - Great question. In my experience you have to use both - supervised and unsupervised. I agree with you that supervised models and known patterns are easier to implement and easily accepted within the organization. And businesses/operations teams are quite reluctant to add overhead from the unsupervised but sometimes all it takes is one event to bring broad change in the organization and that's when you present data analyzed by your unsupervised models and get them accepted as the new normal. Pradeep.