Digging deep into Similarity Search
Let's talk about how the underlying algorithms
With such a vast variety of data, sometimes in billions, how within a matter of seconds, are we able to filter down to just a few results?
Before proceeding, I would assume you already have an idea of what vector databases are, and how we can create these vectors.
KNN: The Inception
KNN or K-Nearest Neighbor algorithm
?
The algorithm can find the nearest vectors using any of the 3 similarity functions;
??
From the given point in a vector plane, the algorithm finds the distance between all the vectors, and k vector points that are closest to the query vector, are returned.
This makes sense, but only until there are a limited number of vectors.
?
As the number of vectors increases to a million or billions, the amount of calculations it would require would be massive to fetch the required results, (As for now, we don’t have easy access to the supercomputers, maybe in a few years who knows).
And this is why KNN is not able to scale after a certain point.
ANN: To The Rescue?
To overcome this, the concept of ANN or Approximate Nearest Neighbor
?
ANN are classified into three categories, trees, hashes, and graphs, of which HNSW or Hierarchical Navigable Small Worlds fits in the graph category.
HNSW uses a profound combination of the two methods to find the best results faster; these are
NSW (Navigable Small World) Graphs
NSW is based on the conceptualization of our actual world where it is theorized that each of us is living 6 degrees apart; which means - If you find 2 strangers on a street you will have to hop at most 6 times in their connection list to discover a connection between the two individuals.
The same concept is applied to vectors as well, with even the number of vectors going above a million, we can still create an NSW graph interlinking each vector point.
领英推荐
Skip Lists,
Skip Lists is a randomized data structure that helps in searching for an element in a sorted list in O(n logn) time complexity, which is faster than O(n).
This data structure is built using a probabilistic approach, where the concept of flip-a-coin comes into play.
It starts with a linked list added at the bottom layer, then for each node deciding if the node should move to the next layer using a coin flip.
For instance, at Node 2 (in the above diagram), we flip a coin and get Heads it moves to the next layer which would be Layer 1.
For the next node considering Node 27, coin flip serves us with Tails, the node won't move to Layer 1.
This process can go on until we reach to a small number at the top layer, thus forming a data structure for faster fetching
When it comes to finding an element in the skip lists, the below algorithm is used;
The path for finding Node 71 is depicted below.
Its process starts from the top layer, so we can skip a good number of non-desirable nodes.
As shown below we can skip almost half of the list in one go and able to find our node, with a value of 71.
HNSW (Hierarchical Navigable Small World)
HNSW is built by placing the vectors created using embeddings into a small world, and making them navigable first, thus forming well well-connected linked list kind of structure.
In the next step, the structure is then built into multiple layers forming a hierarchy of layers of vector planes, as pictured below.
The query on the vector embeddings now can get k-nearest neighbors with much lower latency as the algorithm has a defined path to fetch the results instead of just going at it with brute force. This significantly reduces the underlying resources that are employed in running a single query, but on the contrary, it gives an approximate result that might not be the best always when compared to the KNN approach.
If you have made it to this point, I wish you were able to get something insightful out of this article.
Cheers!