Digging deep into Similarity Search

Digging deep into Similarity Search


Let's talk about how the underlying algorithms that power up the similarity searches of vector databases.

With such a vast variety of data, sometimes in billions, how within a matter of seconds, are we able to filter down to just a few results?


Before proceeding, I would assume you already have an idea of what vector databases are, and how we can create these vectors.


KNN: The Inception

KNN or K-Nearest Neighbor algorithm is what was the starting point for the similarity searches, it explains how to find the nearest 'k' points on a plane that is closest to the given vector.

?

The algorithm can find the nearest vectors using any of the 3 similarity functions;

  • ?Cosine
  • ?Dot product
  • ?Euclidian

??

From the given point in a vector plane, the algorithm finds the distance between all the vectors, and k vector points that are closest to the query vector, are returned.

K-Nearest Neighbor


This makes sense, but only until there are a limited number of vectors.

?

As the number of vectors increases to a million or billions, the amount of calculations it would require would be massive to fetch the required results, (As for now, we don’t have easy access to the supercomputers, maybe in a few years who knows).

And this is why KNN is not able to scale after a certain point.


ANN: To The Rescue?

To overcome this, the concept of ANN or Approximate Nearest Neighbor was introduced, which gives a huge bump to performance, but has a trade-off for the precise results.

?

ANN are classified into three categories, trees, hashes, and graphs, of which HNSW or Hierarchical Navigable Small Worlds fits in the graph category.


HNSW uses a profound combination of the two methods to find the best results faster; these are


NSW (Navigable Small World) Graphs

NSW is based on the conceptualization of our actual world where it is theorized that each of us is living 6 degrees apart; which means - If you find 2 strangers on a street you will have to hop at most 6 times in their connection list to discover a connection between the two individuals.

The same concept is applied to vectors as well, with even the number of vectors going above a million, we can still create an NSW graph interlinking each vector point.


Theory six degrees of separation



Skip Lists,

Skip Lists is a randomized data structure that helps in searching for an element in a sorted list in O(n logn) time complexity, which is faster than O(n).

This data structure is built using a probabilistic approach, where the concept of flip-a-coin comes into play.

It starts with a linked list added at the bottom layer, then for each node deciding if the node should move to the next layer using a coin flip.


Skip List


For instance, at Node 2 (in the above diagram), we flip a coin and get Heads it moves to the next layer which would be Layer 1.

For the next node considering Node 27, coin flip serves us with Tails, the node won't move to Layer 1.


This process can go on until we reach to a small number at the top layer, thus forming a data structure for faster fetching.


When it comes to finding an element in the skip lists, the below algorithm is used;

  • If k = key, done!
  • If k < next key, go down a level
  • If k ≥ next key, go right


The path for finding Node 71 is depicted below.

Its process starts from the top layer, so we can skip a good number of non-desirable nodes.

As shown below we can skip almost half of the list in one go and able to find our node, with a value of 71.

Finding Node with value 71


HNSW (Hierarchical Navigable Small World)

HNSW is built by placing the vectors created using embeddings into a small world, and making them navigable first, thus forming well well-connected linked list kind of structure.

In the next step, the structure is then built into multiple layers forming a hierarchy of layers of vector planes, as pictured below.


HNSW structure


The query on the vector embeddings now can get k-nearest neighbors with much lower latency as the algorithm has a defined path to fetch the results instead of just going at it with brute force. This significantly reduces the underlying resources that are employed in running a single query, but on the contrary, it gives an approximate result that might not be the best always when compared to the KNN approach.


If you have made it to this point, I wish you were able to get something insightful out of this article.


Cheers!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了