KISSing Algorithms: Graph Matching
As a computer science freshman one of the first things you learn is to KISS. This is because software tends to get extremely complex incredibly quickly. Most engineers, especially those who build large systems know this and they KISS. Personally over the years the best products and designs that I have been involved in building have all obeyed the principle - KISS: Keep It Simple, Stupid.
Algorithms however tend to be immensely complicated even before you write a single line of code. For a decade and a half, what has interested me the most is how to keep algorithms simple, almost stupid, so that they can be implemented and maintained in production software while still having measurable theoretical guarantees.
This write-up talks about the story and some of the results of our paper that will be presented later this year at the WWW 2015 conference in Florence. The work was supported in part by BloomReach Inc, and the National Science Foundation.
The Story: I wanted to write a set of programming puzzles for BloomReach that were also related to the problems that we solved internally. The first puzzle (and the one that I expected will get the most submissions) is the following. There are n people who visit a winery and each of them write down d wines that they liked and would be happy to buy any c <= d of them. The winery has one bottle of each of their m wines and would of course like to sell as many bottles as possible. Given the set of wines written down by each person, the puzzle asks to figure out the assignment of wines to people so that the number of wines sold can be maximized.
The problem formally is that of a fractional matching in a bipartite graph. Matching problems and its variations have a lot of applications. In the paper we talk about how it can be used to study websites with recommendations (such as YouTube, Quora, Netflix, any website really) from the angle of discoverability. For e.g, I can ask the following basic and related question - what fraction of YouTube videos (or Quora questions etc) do not have a single other YouTube video (Quora questions etc) recommending it? This is of course if we remove personalization. Its a simple question, yet one that likely no one would have an answer to. It is also the most basic measure that can be used to ensure that great content does not remain forever hidden.
Getting back to the puzzle though - we can convert the wine tasting problem into a graph with left and right partitions. The left partition represents people, the right the wines and the edges are the wines that they liked (d out-edges from each vertex on the left) and the output is a subgraph with c out-edges from each vertex on the left. The goal is to maximize the number of vertices on the right in the output subgraph with an indegree of at least 1. For those familiar with algorithms, note that this problem can be converted into the classic bipartite matching by ‘splitting’ each vertex on the left into c identical vertices.
There are several well-known algorithms to solve these matching problems. Two of the classics that every computer science undergrad learns is Ford-Fulkerson and Hopcroft-Karp. While the classic algorithms are not too complicated, they are not trivial either. They need super-linear runtime which isn't scalable in practice, use significant memory overhead and are hard to implement on a distributed model. In the more general problem of graph matching, implementing Edmond’s Blossom is much more complex.
Problems with Solutions: For the wine-tasting coding puzzle, I wanted to create sample instances that I can allow people submitting answers to test their code against. I did this by randomly sampling edges. I ran BloomReach’s internal implementation of the matching algorithm and it found solutions that were essentially ‘perfect.’ That is - there was a solution where every single person got c wines and the winery sold as many wines as they possibly could. I thought there was something wrong with the random graph generation but quickly realized that its too simple to have a bug. This meant that the production BloomReach code likely had a bug which was disturbing of course. But thankfully after a while I ruled that out as well.
Eureka Moments: As a side-note you can watch one of my former collaborators talk about eureka moments. Often people don’t jump up with ‘eureka’ but much more often with ‘hmm, thats weird.’ This was one such instance. So I proceeded to work out the chance that this can happen and I did the following back-of-the-envelope calculation.
The Combinatorics: Instead of running the classic algorithms, let's say that we just randomly pick the c wines for each person. Again, if you are familiar with algorithms note that if c=d this algorithm’s performance and the question of an existence of a (near) perfect matching is the same. We can do this by computing the probability of selling a bottle for the sampling algorithm. To do that, it is easier to start with computing the probability that a specific wine was not sold. For this to happen, the wine must not be picked by anyone. The probability that it did not appear in a single pick by a single person is (1-1/m) since it is ‘competing’ against the other m wines. Pushing some details under the rug, the probability that it was not picked by anybody is (1 - 1/m)^cn. Therefore, the probability that it was in fact picked is 1 - (1 - 1/m)^cn = 1 - ((1 - 1/m)^m)^(cn/m) which a person with a graduate course in probability or algorithms would recognize as 1 - e^{-cn/m} for (even slightly) large values of m. If you want a proof, apply limit m -> infinity, apply natural log and use L’ Hospital’s rule. It is also likely possible to prove this using some algebra on Taylor series of e^x, using squeeze lemma or Stirling’s approximation but maybe you can work it out. Therefore the expected number of wines that will be sold is m (1 - e^{-cn/m}).
Also, I realized that there are two trivial ‘perfect’ solutions that would make the winery ‘perfectly’ happy. One where every person is satisfied and therefore the winery sells cn wines. The other where every wine, m, that the winery has is sold. Intuitively, the first case happens when there are few very people and a lot of wines. The second when there are a lot of people and very few wines. Therefore it is impossible to do better than either and therefore the best solution is upper-bounded by min {cn, m}.
For visualizing, if we set k = n/m, then m(1 - e^{-cn/m}) / min{cn, m} = (1 - e^{-ck}) / min {ck, 1}. This is the performance guarantee for the simplest bipartite matching algorithm that will ever be devised.
In the plot, ck < 0.2 and ck > 2.4, defines a quantitative measurement for the contention for the wines by the people and represent the ranges when the contention is too low or too high. Only when the contention is smack in between, there is any real distinction in the quality of the solutions between those sent in by people who wrote very simple code from those who implemented the classics. When the problem instances fall outside that range, it really does not matter much. Simultaneously, what this also showed is that only in between those ranges does the input graph not have a (near) perfect matching. Outside of those ranges, there is likely a solution that will make the winery ‘perfectly’ happy.
At this point, I called up a prior collaborator of mine, Ravi, and we started working with Arda Antikacioglou, a Ph.D. student and thanks to BloomReach’s support developed further results.
Real-World Implications: So what does all of this mean in the real-world?
(1) Lets say that you run a website that has recommendations. Nearly every website from Netflix, Quora, Amazon, YouTube have recommendations. You are looking at a video, a product or a question and they recommend a few related videos, products or questions on the right side or below the main content. Let us revisit the simple question. What is the fraction of videos (or products etc) that are there on your website that is not recommended from a single other video (or product)? You are thinking and hoping that it is a fraction of a percent since you want a random surfer to be able to navigate across your website. But for large websites that fraction could be incredibly high. This is just the most basic test on how well connected your website is and if you have a lot of content that no one has a chance of finding serendipitously. How do you mitigate this problem? Besides quality based metrics that you are using to find the say 3 best recommendations, you might want to retrieve and then randomly pick the 3 from them. What our results seem to indicate intuitively is that even throwing in a little bit of randomness might go a long way.
(2) Lets say that you are trying to solve the fractional or simple bipartite matching problem. And there are a ton of different places where they show up. If you have a reason to believe that the underlying graph has edges that are selected close to uniformly at random, then check the above plot and perhaps all you need to do is select a set of edges randomly for each vertex. Even, if you do not know the underlying graph distribution just try the above simple (stupid, really) algorithm. If your solution is close to the theoretical upper bound of min{cn, m}, you are done. Your code will be completely trivial and super easy to maintain. If you wanted a better solution, then we show that you can do the above random process in 'rounds.' Select an edge at random in each round. Remove the vertices on the right that have an indegree of 1. Repeat for c rounds. We show that the above heuristic does incredibly well both in theory and practice.
There are a lot of fun extensions and more realistic models of these problems that you can find in our paper and a lot more that are yet to be explored. Overall, as always, don’t think too much, its not worth it. Just KISS.