登录查看更多内容

Graph Customer Similarity

Harry Powell

Data science leader with track record of innovation and value creation

发布日期: 2021年5月10日

… or how we devised a new way to calculate similarity across networks of shops…

This project was more of a piece of research rather than a product but it showed real potential to generate significant revenue for the Bank I worked at. Because the work is novel and in-depth, this short case study will not give all the details of implementation. However the project has been presented extensively at the time and if you are interested you can find write-ups and presentations online here https://www.slideshare.net/HarryPowell4/graph-recommendations

[BTW, I am tempted to revisit the work in the light of watching Jure Lescovec’s fascinating Stanford University course on Machine Learning with Graphs https://www.youtube.com/playlist?list=PLoROMvodv4rPLKxIpqhjhPgdQy7imNkDn ]

At the core of machine learning techniques is a distance metric telling you how similar one datapoint is to another. We became interested in how similar two businesses (or two customers) were to each other. This could be used in a recommendation engine, for example. We thought about our transactional data, which is normally represented as a sequence of transactions. However it can also be reimagined as a graph or network, much like the better known social networks in Facebook or Twitter. Mathematically any set of pairs (in this case transactions between customers and retailers) can be thought of as a graph, in this case bipartite with each retailer connected together through a set of customers. This offers the opportunity of measuring the distance between businesses (or alternatively between customers) in preference space, with a business being more similar the fewer steps it takes to get to it through the graph.

A lot of academic work has been done on graph theory and there are many metrics and concepts for measuring distance, but none of the concepts that are conventionally used quite worked in this case. The key metric in graph theory is minimum distance, in other words the length of the shortest path which connects two notes. This does not work in our kind of network because, for example, virtually everyone in the south west of England has been to the Ikea at Cribbs Causeway in Bristol; There are hubs in the network. This means that the minimum distance between any two shops is always either one or two steps which is not very useful.

Alternatively one could think about the distance between two businesses using cosine similarity, a conventional metric in recommender systems, but this is again flawed in any network which is sparse. Cosine similarity measures the overlap between the customers of two shops, but given that people tend to shop locally this may be virtually nonexistent between shops in different localities. However there may be considerable overlap in the second or third degree of separation, which is not measured by conventional metrics.

We wanted to come up with a metric that could measure the distance between two nodes using the information of the entire graph. We imagined standing in a shop (the source node) with a number of red tokens, giving one to each customer who entered the shop, and instructing them to mark that token and then hand it over to a random customer at the next shop they visited, instructing that customer to do likewise. We then imagined standing at another shop (the destination node) waiting for any customer with a token, which we would collect. The average number of marks on the collected tokens would be the expected degrees of separation between the source and destination nodes. This is a similarity metric which met our requirements.

So far so good. But the real challenge was how to compute this in a network of say 1 million businesses. The model can be represented as a Markov transition matrix with absorbing states which has a closed form solution. Unfortunately this involves inverting a 1 million sided square matrix 1 million times to calculate the EDS between each node. Given that matrix inversion is an O(n^3) process that's going to take a long time.

We thought quite a lot about how to diagonalize the transition matrix in order to ensure an efficient partitioning given that it should be quite unlikely that a customer of a supermarket in Bristol was also a customer of a supermarket in Newcastle, and we came up with a number of promising strategies. An alternative is to use a sampling methodology to estimate EDS, or simply to multiply the matrix a limited number of times in the hope that EDS will have converged sufficiently by then. Both of these methods are pretty quick in comparison to the exact solution, particularly if you can compress your transition matrix into a small enough size to fit into the memory of a GPU card.

The results were interesting. EDS resulted in a much more intuitive distance than the conventional metrics, but the estimate of EDS with limited path length seems to be even better than the exact calculation using the closed form solution. This is unusual. Normally you would expect the exact solution to be the best. However it turns out that only the shorter parts between source and destination nodes really carry much informational value. Once the path departs greatly from a relatively direct route it's arrival at the destination is as much coincidence as anything else. By limiting the parts with an arbitrary cut off distance we eliminated that noise and ended up with much better results. This was a surprise to us and a big learning. Occasionally estimation can improve the informational quality of a result.

The other big learning was that computation is often the real problem in data science. This is often overlooked and people assume that you can just buy enough processing power to solve any problem. But a lot of machine learning methodologies have non-linear complexity yet our ability to add processors is essentially linear. We needed to use all our understanding of what was going on “under the bonnet” of the computer in order to make our calculation tractable. We needed to take computational considerations into account virtually from the start of the project, even when we were thinking about the analytical challenge. One of the reasons we were able to do this is because the team at the Bank was structured to integrate the mathematicians and the computer scientists very closely. We never build small data prototypes and then threw them over the wall to some poor data engineer to scale. I don't believe that often results in a satisfactory product.

A final interesting thing about this project was that while we built it as a research project with a particular use case in mind (recommendation), in fact it will probably be used more for another use case, partitioning data for competitor cohort analysis. Alternatively, the vector of distances between businesses or customers is effectively a complete description of a customers latent preferences or a businesses values. This is like a strand of DNA which can describe a subject with much more detail than any set of features drawn from a cross-sectional data base, and it has more credibility that clickstream behaviour data because it is derived from actual purchases. While it is important that data scientists focus on problems that the business needs to solve, it is also essential that your data scientists are given some freedom to develop new ideas and new technologies which have the potential to transform your business. This project was one of those.

Rotimi Ogunbiyi

IT Specialist || Technical Support || Cybersecurity || Computer Maintenance & Repair || Computer Networking Technician

7 个月

Does anyone has idea or has worked on, "similarity measures of social media database". Particularly getting data to work on,

Sriram S

3 年

Can we use graph colouring to colour similar customers .. vertex colouring I mean . Just a thought!

Mishal Patel

Leading AI-driven drug discovery as Corporate Vice President, AI & Digital Research at Novo Nordisk

3 年

Nice article Harry.

Peter Curtis

Artistic engineer developing solutions for climate resilience.

3 年

Paul P.

Peter Curtis

Artistic engineer developing solutions for climate resilience.

3 年

Liam McGee

查看更多评论

要查看或添加评论，请登录

Harry Powell的更多文章

Why Your Single Source of Data Truth Project Is Failing (and Will Keep Failing), and How to Succeed in a Different Way

2025年2月28日

Why Your Single Source of Data Truth Project Is Failing (and Will Keep Failing), and How to Succeed in a Different Way

Many organizations try to create a “single source of data truth.” They want a single place where you can see all…

5 条评论
How to get sponsorship for your AI project

2025年2月14日

How to get sponsorship for your AI project

AI projects are likely to be complex, expensive and may impact a customer’s business model. We should expect that, to…

2 条评论
Graph use-case archetypes

2023年5月5日

Graph use-case archetypes

This note is to try to help you think about use cases for graph data analytics and machine learning in your…

1 条评论
Driving sustainable growth in banks by connecting customer data using a graph database

2023年5月4日

Driving sustainable growth in banks by connecting customer data using a graph database

Growing a banking business requires you to make good decisions at each stage of the value cycle from acquiring new…

8 条评论
What questions should you ask of Chat-GPT based analytics platforms?

2023年3月31日

What questions should you ask of Chat-GPT based analytics platforms?

You know the scenario. You are flooded with sales guys showing you amazing software applications based on some AI…

4 条评论
How to Think Differently

2022年1月22日

How to Think Differently

Last year, I wrote a series of micro-blogs about how to think differently about data science and analytics questions…

5 条评论
A business leader’s short guide to Graph Databases: What they are and why you need them.

2021年12月19日

A business leader’s short guide to Graph Databases: What they are and why you need them.

The word “graph” is very fashionable in IT circles right now, but graphs have actually been around for a while (graph…

11 条评论
A tribute to my InDigital colleagues at JLR

2021年12月17日

A tribute to my InDigital colleagues at JLR

Dear JLR InDigital Colleagues, In my final few days at Jaguar Land Rover I have been thinking quite a lot about what we…

22 条评论
Thinking differently: Avoiding Optimisation 1/2

2021年11月8日

Thinking differently: Avoiding Optimisation 1/2

A lot of business people and analysts are obsessed with optimisation; the best possible business strategy; a single…

10 条评论
Bayesian A-B Testing

2021年5月23日

Bayesian A-B Testing

… or how we were able to make decisions about price tests in half the time..

5 条评论

See all articles

Graph Customer Similarity

Harry Powell

Data science leader with track record of innovation and value creation

Harry Powell的更多文章

社区洞察

其他会员也浏览了

“Big Data” - Coming to a Customer Experience Near You

How AI Can Make Working With Customer Data So Much Faster

Unleash The Power Of Data: Transforming Business Models For Tomorrow’s Market

The Oracle's Algorithm: Why AI-Powered Customer Insights Hold the Key to Business Success

Who am I and Who Knows Me Best?

The dawn of operational intelligence

Discovering the Contact Center data goldmine

The CTO and the Observational Guru

AI-Driven Data Quality: Solving the Business Need Gap - Part 2

The Future is Now: Real-Time Data Analytics for Smarter Business Moves

Harry Powell的更多文章

Why Your Single Source of Data Truth Project Is Failing (and Will Keep Failing), and How to Succeed in a Different Way

How to get sponsorship for your AI project

Graph use-case archetypes

Driving sustainable growth in banks by connecting customer data using a graph database

What questions should you ask of Chat-GPT based analytics platforms?

How to Think Differently

A business leader’s short guide to Graph Databases: What they are and why you need them.

A tribute to my InDigital colleagues at JLR

Thinking differently: Avoiding Optimisation 1/2

Bayesian A-B Testing

社区洞察

其他会员也浏览了

“Big Data” - Coming to a Customer Experience Near You

How AI Can Make Working With Customer Data So Much Faster

Unleash The Power Of Data: Transforming Business Models For Tomorrow’s Market

The Oracle's Algorithm: Why AI-Powered Customer Insights Hold the Key to Business Success

Who am I and Who Knows Me Best?

The dawn of operational intelligence

Discovering the Contact Center data goldmine

The CTO and the Observational Guru

AI-Driven Data Quality: Solving the Business Need Gap - Part 2

The Future is Now: Real-Time Data Analytics for Smarter Business Moves