登录查看更多内容

Approximate nearest neighbors (ANN)

Mina Ashraf Gamil

CS Masters student @ GaTech

发布日期: 2022年5月1日

Finding the nearest neighbors in your data can be a very useful exercise in hundreds of applications. You will always find KNN models making their way into your solution in one way or another. They are used in recommendation engines, anomaly detection, classification of semantically similar documents, etc, but they have a serious problem, which is scalability.?

KNN models are lazy non-parametric models, which means that when you do the famous model.fit(), they don’t actually learn anything, they just store a reference for the training data that they will use in doing the search.

So far so good, but why is scalability an issue here? Simply because the algorithmic complexity of running a single query is Linear, O(N). Thus, if you have N users for whom you want to find their nearest neighbors, the complexity will be quadratic, which is not really an option in the big data era. Note that we assumed that the distance measure you are using is running in constant time O(1), which is rarely the case with high-dimensional data.?

To make sure you understand the last part, let’s say that you are using cosine similarity to measure the distance between two items in a dataset, and if each item is represented as a high dimensional vector of (100+ entries), you will compute the dot product between each pair of vectors each time, which adds another layer of complexity.

Imagine Youtube running this algorithm for their 2.6 billion active users, and then wait for a few years until you get 1 recommendation that you may or may not like ??.

?I think you now understand why it is usually a bad idea to use KNN in production systems, but surely there should be a solution to this, right? Yes, here is where we use the approximate nearest neighbors search (ANN), which can execute a single query in O(log N) complexity. This is a HUUUUGE difference, even if it sacrifices the accuracy a little bit. KNN was not really an option to consider. There are a couple of implementations for ANN, including Facebook’s PySparseNN (for sparse interaction matrices), annoy (Approximate nearest neighbor, Oh YEAH) by Spotify, etc.

here is a visualization for how Annoy works.

领英推荐

The Trick That Helps All Statisticians Survive

Keith McNulty 7 个月前

Determining weights in a GRAPHRAG

Ajit Jaokar 11 个月前

Why is data the least important part of Competitive…

Octopus Competitive Intelligence 2 年前

How does it work?

The idea is very simple, you pick two points from your data randomly, and find a hyperplane that is equidistant between the two points to split the data. Then find two other points, and repeat, until you end up with a division that has “K” number of points, then you stop.

Now, you have a tree that you can simply traverse to find a group of neighbors to your query point and get your results in Logarithmic complexity. You can make it even better by not relying on a single tree, but a whole forest of trees to make the splits. Just keep in mind that these trees are stored in memory, and for big datasets, it is usually very heavy, and the complexity would increase you now have to traverse multiple trees to make your query.?

and here is what it looks like at making your queries after you have built your forest

Now you can simply run your distance function on comparison on the few nodes that the tree traversal led to, which makes this system scalable while providing good results.

source of the amazing images used in this article: Nearest neighbors and vector models – part 2 – algorithms and data structures · Erik Bernhardsson

Approximate nearest neighbors (ANN)

Mina Ashraf Gamil

CS Masters student @ GaTech

领英推荐

How does it work?

社区洞察

其他会员也浏览了

Reddit's Structured Approach to Unstructured Data

Unleash The Power Of Data: Transforming Business Models For Tomorrow’s Market

Null Hypothesis #7: vertical limit

Data Can Point But it Can't Touch

Understanding Poisson Distribution in Data Science

The Empathic Analyst

Insight, Not Data: The True Currency of Digital Transformation

Analysing Open-ended Comments (almost) for free

Data is the New Oil: Unveiling the Modern Day Technology Metaphors

Using GenAI for Analytics + using GenAI to understand something technical