登录查看更多内容

Digging deep into Similarity Search

Utsav Talwar

Solutions Architect (Gen AI) @ MongoDB | MBA - IIT Delhi | BITS Pilani

发布日期: 2024年5月17日

Let's talk about how the underlying algorithms that power up the similarity searches of vector databases.

With such a vast variety of data, sometimes in billions, how within a matter of seconds, are we able to filter down to just a few results?

Before proceeding, I would assume you already have an idea of what vector databases are, and how we can create these vectors.

KNN: The Inception

KNN or K-Nearest Neighbor algorithm is what was the starting point for the similarity searches, it explains how to find the nearest 'k' points on a plane that is closest to the given vector.

The algorithm can find the nearest vectors using any of the 3 similarity functions;

?Cosine
?Dot product
?Euclidian

From the given point in a vector plane, the algorithm finds the distance between all the vectors, and k vector points that are closest to the query vector, are returned.

This makes sense, but only until there are a limited number of vectors.

As the number of vectors increases to a million or billions, the amount of calculations it would require would be massive to fetch the required results, (As for now, we don’t have easy access to the supercomputers, maybe in a few years who knows).

And this is why KNN is not able to scale after a certain point.

ANN: To The Rescue?

To overcome this, the concept of ANN or Approximate Nearest Neighbor was introduced, which gives a huge bump to performance, but has a trade-off for the precise results.

ANN are classified into three categories, trees, hashes, and graphs, of which HNSW or Hierarchical Navigable Small Worlds fits in the graph category.

HNSW uses a profound combination of the two methods to find the best results faster; these are

Navigable Small World Graphs
Skip Lists?

NSW (Navigable Small World) Graphs

NSW is based on the conceptualization of our actual world where it is theorized that each of us is living 6 degrees apart; which means - If you find 2 strangers on a street you will have to hop at most 6 times in their connection list to discover a connection between the two individuals.

The same concept is applied to vectors as well, with even the number of vectors going above a million, we can still create an NSW graph interlinking each vector point.

领英推荐

The Trick That Helps All Statisticians Survive

Keith McNulty 7 个月前

Determining weights in a GRAPHRAG

Ajit Jaokar 11 个月前

Null Hypothesis #7: vertical limit

Giancarlo Vercellino 2 年前

Skip Lists,

Skip Lists is a randomized data structure that helps in searching for an element in a sorted list in O(n logn) time complexity, which is faster than O(n).

This data structure is built using a probabilistic approach, where the concept of flip-a-coin comes into play.

It starts with a linked list added at the bottom layer, then for each node deciding if the node should move to the next layer using a coin flip.

For instance, at Node 2 (in the above diagram), we flip a coin and get Heads it moves to the next layer which would be Layer 1.

For the next node considering Node 27, coin flip serves us with Tails, the node won't move to Layer 1.

This process can go on until we reach to a small number at the top layer, thus forming a data structure for faster fetching.

When it comes to finding an element in the skip lists, the below algorithm is used;

If k = key, done!
If k < next key, go down a level
If k ≥ next key, go right

The path for finding Node 71 is depicted below.

Its process starts from the top layer, so we can skip a good number of non-desirable nodes.

As shown below we can skip almost half of the list in one go and able to find our node, with a value of 71.

HNSW (Hierarchical Navigable Small World)

HNSW is built by placing the vectors created using embeddings into a small world, and making them navigable first, thus forming well well-connected linked list kind of structure.

In the next step, the structure is then built into multiple layers forming a hierarchy of layers of vector planes, as pictured below.

The query on the vector embeddings now can get k-nearest neighbors with much lower latency as the algorithm has a defined path to fetch the results instead of just going at it with brute force. This significantly reduces the underlying resources that are employed in running a single query, but on the contrary, it gives an approximate result that might not be the best always when compared to the KNN approach.

If you have made it to this point, I wish you were able to get something insightful out of this article.

Cheers!

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

Digging deep into Similarity Search

Utsav Talwar

Solutions Architect (Gen AI) @ MongoDB | MBA - IIT Delhi | BITS Pilani

KNN: The Inception

ANN: To The Rescue?

领英推荐

社区洞察

其他会员也浏览了

Balancing Explore v Exploit Data Trade-offs

Election Data Science and the Death of Truth

Modern Data Science: Monogamy or Ménage à trois ?

The Raw Output of Llama3.1 on Last Topic

Using GenAI for Analytics + using GenAI to understand something technical

Delivering The Right Level Of Analytical Detail

Avoiding Errors of Interpretation: the case of Selby & Ainsty

Who will win in the election 2026?

Different random forest packages in R

Changing the narrative: Data, cities and Sir Professor Shadbolt