登录查看更多内容

KISSing Algorithms: Graph Matching

Srinath Sridhar

CEO & Co-Founder at regie.ai

发布日期: 2015年2月14日

As a computer science freshman one of the first things you learn is to KISS. This is because software tends to get extremely complex incredibly quickly. Most engineers, especially those who build large systems know this and they KISS. Personally over the years the best products and designs that I have been involved in building have all obeyed the principle - KISS: Keep It Simple, Stupid.

Algorithms however tend to be immensely complicated even before you write a single line of code. For a decade and a half, what has interested me the most is how to keep algorithms simple, almost stupid, so that they can be implemented and maintained in production software while still having measurable theoretical guarantees.

This write-up talks about the story and some of the results of our paper that will be presented later this year at the WWW 2015 conference in Florence. The work was supported in part by BloomReach Inc, and the National Science Foundation.

The Story: I wanted to write a set of programming puzzles for BloomReach that were also related to the problems that we solved internally. The first puzzle (and the one that I expected will get the most submissions) is the following. There are n people who visit a winery and each of them write down d wines that they liked and would be happy to buy any c <= d of them. The winery has one bottle of each of their m wines and would of course like to sell as many bottles as possible. Given the set of wines written down by each person, the puzzle asks to figure out the assignment of wines to people so that the number of wines sold can be maximized.

The problem formally is that of a fractional matching in a bipartite graph. Matching problems and its variations have a lot of applications. In the paper we talk about how it can be used to study websites with recommendations (such as YouTube, Quora, Netflix, any website really) from the angle of discoverability. For e.g, I can ask the following basic and related question - what fraction of YouTube videos (or Quora questions etc) do not have a single other YouTube video (Quora questions etc) recommending it? This is of course if we remove personalization. Its a simple question, yet one that likely no one would have an answer to. It is also the most basic measure that can be used to ensure that great content does not remain forever hidden.

Getting back to the puzzle though - we can convert the wine tasting problem into a graph with left and right partitions. The left partition represents people, the right the wines and the edges are the wines that they liked (d out-edges from each vertex on the left) and the output is a subgraph with c out-edges from each vertex on the left. The goal is to maximize the number of vertices on the right in the output subgraph with an indegree of at least 1. For those familiar with algorithms, note that this problem can be converted into the classic bipartite matching by ‘splitting’ each vertex on the left into c identical vertices.

There are several well-known algorithms to solve these matching problems. Two of the classics that every computer science undergrad learns is Ford-Fulkerson and Hopcroft-Karp. While the classic algorithms are not too complicated, they are not trivial either. They need super-linear runtime which isn't scalable in practice, use significant memory overhead and are hard to implement on a distributed model. In the more general problem of graph matching, implementing Edmond’s Blossom is much more complex.

Problems with Solutions: For the wine-tasting coding puzzle, I wanted to create sample instances that I can allow people submitting answers to test their code against. I did this by randomly sampling edges. I ran BloomReach’s internal implementation of the matching algorithm and it found solutions that were essentially ‘perfect.’ That is - there was a solution where every single person got c wines and the winery sold as many wines as they possibly could. I thought there was something wrong with the random graph generation but quickly realized that its too simple to have a bug. This meant that the production BloomReach code likely had a bug which was disturbing of course. But thankfully after a while I ruled that out as well.

Eureka Moments: As a side-note you can watch one of my former collaborators talk about eureka moments. Often people don’t jump up with ‘eureka’ but much more often with ‘hmm, thats weird.’ This was one such instance. So I proceeded to work out the chance that this can happen and I did the following back-of-the-envelope calculation.

The Combinatorics: Instead of running the classic algorithms, let's say that we just randomly pick the c wines for each person. Again, if you are familiar with algorithms note that if c=d this algorithm’s performance and the question of an existence of a (near) perfect matching is the same. We can do this by computing the probability of selling a bottle for the sampling algorithm. To do that, it is easier to start with computing the probability that a specific wine was not sold. For this to happen, the wine must not be picked by anyone. The probability that it did not appear in a single pick by a single person is (1-1/m) since it is ‘competing’ against the other m wines. Pushing some details under the rug, the probability that it was not picked by anybody is (1 - 1/m)^cn. Therefore, the probability that it was in fact picked is 1 - (1 - 1/m)^cn = 1 - ((1 - 1/m)^m)^(cn/m) which a person with a graduate course in probability or algorithms would recognize as 1 - e^{-cn/m} for (even slightly) large values of m. If you want a proof, apply limit m -> infinity, apply natural log and use L’ Hospital’s rule. It is also likely possible to prove this using some algebra on Taylor series of e^x, using squeeze lemma or Stirling’s approximation but maybe you can work it out. Therefore the expected number of wines that will be sold is m (1 - e^{-cn/m}).

Also, I realized that there are two trivial ‘perfect’ solutions that would make the winery ‘perfectly’ happy. One where every person is satisfied and therefore the winery sells cn wines. The other where every wine, m, that the winery has is sold. Intuitively, the first case happens when there are few very people and a lot of wines. The second when there are a lot of people and very few wines. Therefore it is impossible to do better than either and therefore the best solution is upper-bounded by min {cn, m}.

For visualizing, if we set k = n/m, then m(1 - e^{-cn/m}) / min{cn, m} = (1 - e^{-ck}) / min {ck, 1}. This is the performance guarantee for the simplest bipartite matching algorithm that will ever be devised.

In the plot, ck < 0.2 and ck > 2.4, defines a quantitative measurement for the contention for the wines by the people and represent the ranges when the contention is too low or too high. Only when the contention is smack in between, there is any real distinction in the quality of the solutions between those sent in by people who wrote very simple code from those who implemented the classics. When the problem instances fall outside that range, it really does not matter much. Simultaneously, what this also showed is that only in between those ranges does the input graph not have a (near) perfect matching. Outside of those ranges, there is likely a solution that will make the winery ‘perfectly’ happy.

At this point, I called up a prior collaborator of mine, Ravi, and we started working with Arda Antikacioglou, a Ph.D. student and thanks to BloomReach’s support developed further results.

Real-World Implications: So what does all of this mean in the real-world?

(1) Lets say that you run a website that has recommendations. Nearly every website from Netflix, Quora, Amazon, YouTube have recommendations. You are looking at a video, a product or a question and they recommend a few related videos, products or questions on the right side or below the main content. Let us revisit the simple question. What is the fraction of videos (or products etc) that are there on your website that is not recommended from a single other video (or product)? You are thinking and hoping that it is a fraction of a percent since you want a random surfer to be able to navigate across your website. But for large websites that fraction could be incredibly high. This is just the most basic test on how well connected your website is and if you have a lot of content that no one has a chance of finding serendipitously. How do you mitigate this problem? Besides quality based metrics that you are using to find the say 3 best recommendations, you might want to retrieve and then randomly pick the 3 from them. What our results seem to indicate intuitively is that even throwing in a little bit of randomness might go a long way.

(2) Lets say that you are trying to solve the fractional or simple bipartite matching problem. And there are a ton of different places where they show up. If you have a reason to believe that the underlying graph has edges that are selected close to uniformly at random, then check the above plot and perhaps all you need to do is select a set of edges randomly for each vertex. Even, if you do not know the underlying graph distribution just try the above simple (stupid, really) algorithm. If your solution is close to the theoretical upper bound of min{cn, m}, you are done. Your code will be completely trivial and super easy to maintain. If you wanted a better solution, then we show that you can do the above random process in 'rounds.' Select an edge at random in each round. Remove the vertices on the right that have an indegree of 1. Repeat for c rounds. We show that the above heuristic does incredibly well both in theory and practice.

There are a lot of fun extensions and more realistic models of these problems that you can find in our paper and a lot more that are yet to be explored. Overall, as always, don’t think too much, its not worth it. Just KISS.

要查看或添加评论，请登录

Srinath Sridhar的更多文章

Writing Papers at Startup: My Work at Onera

2023年2月20日

Writing Papers at Startup: My Work at Onera

Last month, Carnegie Mellon's wrote an article about our work at Onera, my previous startup. This was based on our…
New Frontiers in AI: The Next Generation of Enterprise Text Editors

2021年7月19日

New Frontiers in AI: The Next Generation of Enterprise Text Editors

I am definitely the last person to drink the kool-aid or jump on a new fashionable technology fad. People who know me…
Corona Virus 7-Day Forecast & An Opportunity to do some Algebra

2020年3月27日

Corona Virus 7-Day Forecast & An Opportunity to do some Algebra

This is the second part of the corona virus simulation and you may be interested in our more extensive simulation of…

1 条评论
Computer Simulations to Show the Impact of Social Distancing in Spreading the Coronavirus

2020年3月17日

Computer Simulations to Show the Impact of Social Distancing in Spreading the Coronavirus

We designed a model simulation to show how diseases such as the corona virus can spread through communities. While the…

14 条评论
A Simple, Elegant Mathematical Structure at the Heart of Fulfillment Supply Chain

2019年12月8日

A Simple, Elegant Mathematical Structure at the Heart of Fulfillment Supply Chain

This has been an article based on an observation I made a long time back but never had a chance to write it up. If you…
Streaming Machine Learning

2019年3月19日

Streaming Machine Learning

This article builds on a few of the previous write-ups. While still independent, and definitely suitable for a general…
What do the Apollo missions, drones and Onera have in common?

2019年2月25日

What do the Apollo missions, drones and Onera have in common?

We don’t think much about computer controlled systems that routinely help everyone. But they are the heroes behind the…
Real World Modeling: Joint Estimation & Optimization

2018年7月24日

Real World Modeling: Joint Estimation & Optimization

As described in the previous post, one of Onera’s core product offerings involves a problem modeled at a 10,000 foot…

1 条评论
Fast Simple Solutions to the Knapsack Problem

2017年5月22日

Fast Simple Solutions to the Knapsack Problem

I find it rewarding to look at algorithms that have provable guarantees while also being very simple to implement…

5 条评论
The Four Software Engineering Personality Types

2016年2月8日

The Four Software Engineering Personality Types

People are fascinated by personality types, such as the Myers-Briggs (MBTI). Psychological classifications are…

34 条评论

See all articles

KISSing Algorithms: Graph Matching

Srinath Sridhar

CEO & Co-Founder at regie.ai

Srinath Sridhar的更多文章

社区洞察

其他会员也浏览了

Fastest way to finetune and deploy Large Language Model without writing any code

Explainable ML models with SHAP

Limitations of the Linear Kalman Filter

LLM fine-tuning and model selection + other resources

Is Machine Learning an Art, a Science or Something Else?

TensorFlow.js Monthly #7: RoboFlow.js, Coral Edge TPU acceleration for Node.js, and OCR recognition in the browser

SVM machine learning

Decision Tree: Building Machine Learning Model

Linear Discriminant Analysis (LDA)

XG Boost Algorithm

Srinath Sridhar的更多文章

Writing Papers at Startup: My Work at Onera

New Frontiers in AI: The Next Generation of Enterprise Text Editors

Corona Virus 7-Day Forecast & An Opportunity to do some Algebra

Computer Simulations to Show the Impact of Social Distancing in Spreading the Coronavirus

A Simple, Elegant Mathematical Structure at the Heart of Fulfillment Supply Chain

Streaming Machine Learning

What do the Apollo missions, drones and Onera have in common?

Real World Modeling: Joint Estimation & Optimization

Fast Simple Solutions to the Knapsack Problem

The Four Software Engineering Personality Types

社区洞察

其他会员也浏览了

Fastest way to finetune and deploy Large Language Model without writing any code

Explainable ML models with SHAP

Limitations of the Linear Kalman Filter

LLM fine-tuning and model selection + other resources

Is Machine Learning an Art, a Science or Something Else?

TensorFlow.js Monthly #7: RoboFlow.js, Coral Edge TPU acceleration for Node.js, and OCR recognition in the browser

SVM machine learning

Decision Tree: Building Machine Learning Model

Linear Discriminant Analysis (LDA)

XG Boost Algorithm