Vector DB Indexing Internals for RAG Applications

Since my last article, I was split on what to write. One option was the Willow chip with 105 qubits, announced by Google, which sparked speculative correlations with the multiverse. However, I decided to focus on something more practical: AI applications using RAG.

A critical component of RAG is Vector Databases (Vector DBs), which are the centerpiece for enabling efficient similarity search and retrieval. Within Vector DBs, there is a diverse range of indexing algorithms, each targeted at specific use cases and performance requirements. In this article, I will dive deeper into how these indexing methods are designed and how they optimize search efficiency.

Imagine a company facing a lawsuit for failing to deliver on a contract. It needs to provide documents, communications, logs, and other evidence to counter the claim. Luckily, its backup software indexed data during backups. This includes emails, files, and logs, making them searchable. The software links related items, like emails to their attachments and logs. A query like “Contract Project Jarvis” could return communications, approvals and modifications showing delays caused by external factors.

There are multiple use cases that the customer may have.

1.?It could want to retrieve all the data related to the project using a brute force method. (Flat Index /Brute Force)

2.It could want to look at the data specific cluster wise. E.g. Legal, Pre Sales, Sales, technical. (Cluster / IVF)

3. It could want to search documents that not only have direct relation but also some context-based connection with the search query. (HNSW)

While I’m not specifically addressing how this data is stored in a Vector DB, if you’ve read my articles on the framework of Transformers, you’ll know that features or tokens are represented as embeddings. Similarly, in Vector DBs, all features are represented as embeddings. These embeddings, like in Transformer models, capture the semantic meaning of the data. The Vector DB itself handles the creation of these embeddings through its built-in functions.

E.g.

documents = [

"Contract negotiation with Client X.",

"Delivery issues reported on Project Jarvis.",

"Payment details for invoice #12345."

]

#Generate Embeddings:

embeddings = model.encode(documents)

For now, let’s not worry about how these embeddings look like. They come directly from the learned weights of the LLM during its training, capturing the semantic meaning of the data. In simple words, document (broken into smaller parts) is tokenized and passed through the model’s layers where the learnt weights are applied. The result is a fixed-size vector embedding, which is then inserted into the vector database for efficient storage and retrieval. After adding embeddings, the Vector DB creates an index for fast retrieval.

Before diving into how these indexes are set up, let’s first understand two foundational metrics—Euclidean Distance and Cosine Similarity. Both play a critical role in determining how similar embeddings are and influence the retrieval process.

Euclidean Distance: Euclidean Distance measures the straight-line distance between two points in a multidimensional space. You can look at it from a two-dimensional point of view as 2 points in X, Y space.

We can imagine A(X1, Y1) and B(X2, Y2) then Euclidean distance can be represented as following.

For N dimensional space we can write it as following.

Why are terms subtracted, squared, and then square-rooted? This is done to avoid cancellation of terms during summation. Squaring ensures that all terms contribute positive values, regardless of direction. Since squaring the terms introduces non-linearity, we take the square root at the end to return to a linear scale, giving us the true distance. There are variations of the Euclidean distance, but they could be discussed as an when they arise.

Cosine Similarity: You can imagine a 3D space where every feature is there at a coordinate. One can root them to the center (0,0,0) and then distance between these two points can be considered as a function of angle between them.

Since these two are vectors, they have both the direction and magnitude. In vector form the equation for A.B can be seen as follow. A ? B = ∥A∥ ∥B∥ cos θ

Cosine similarity = A.B / ∥A∥ ∥B∥ where ∥A∥ and ∥B∥ are the magnitude of the vector and A.B is the dot product.

E.g.

Now that we understand the basic metrics, let’s explore the indexing methods used in Vector Dbs.

Flat Index (Brute Force)

We’ll start with the simplest—Flat Index (Brute Force)—where data is stored linearly without any hierarchical relationships. You can imagine it as shown below, where the green dots represent the stored embeddings.

Searching through this data means going through all the points one by one to find the nearest matches for the query embedding. As shown in the graph, every point is checked sequentially, which ensures accuracy but can be slow for large datasets.

Flat Index compares every point to the query embedding using Cosine similarity or Euclidean distance. It ensures accuracy but takes O(N) time, making it slow for large datasets.

IVF (Inverted File Index)

Geometrically, you can imagine this as certain centroid points in space around which the embeddings (data points) are clustered. These centroids could represent categories such as sales, pre-sales, technical, and legal. Each embedding, which is a numerical representation of the data, is assigned to the nearest centroid by calculating either cosine similarity or Euclidean distance. Based on these distances, embeddings are stored under their respective centroids, allowing efficient searches by limiting comparisons to only the embeddings within the relevant cluster.

This process of organizing embeddings into clusters is called k-means clustering, a method where data points are grouped around centroids based on their proximity. IVF leverages this extensively to create a structure that facilitates faster searches by narrowing down to relevant clusters.

You can visualize how centroids are created, and the embeddings are centered around it by following figures.

The search works by first comparing the query embedding (vector) to the centroids in the index using distance metrics like cosine similarity or Euclidean distance. Based on these comparisons, the closest clusters (as determined by their centroids) are selected for further search. This approach significantly reduces the number of comparisons compared to brute force. However, in cases of mis-mapping, where the query embedding is closer to a centroid but not all relevant data points, some results might be missed.

HNSW (Hierarchical Navigable Small World)

In HNSW, all embeddings are called as nodes. Like in previous case, each node does cosine similarity or Euclidean distance calculation to figure out its nearest neighbors.

All the nodes / embeddings are inserted into the index starting from the lowest layer. Suppose there are N nodes and a maximum of M possible connections per node. At the bottom layer, all N nodes are present, and each node forms up to M connections with its nearest Neighbours.

Nodes are assigned a random value r between 0 and 1. A threshold P(l)= p^l, where p=1/e, determines if a node moves to the next layer. Nodes with r ≤ P(l) are promoted, ensuring fewer nodes in higher layers. Each promoted node in the new layer forms up to M connections with its nearest neighbors. If M connections cannot be made, the node forms as many as possible below M.

This process repeats for each subsequent layer, with P(l) decreasing exponentially (N(l)=N(l-1) ?p^l), ensuring fewer nodes in higher layers and sparsity.

When adding a new embedding to the graph, efConstruction dictates how many neighbors to evaluate at each layer. This ensures the new embedding forms better connections, improving graph quality.

How does the search work?

As in IVF and flat search, the query is first converted into an embedding using a relevant embedding model and then the embedding is passed to the Vector DB APIs for retrieval. The search begins at the top layer, starting from a single-entry point, which is chosen randomly or pre-selected. A fixed number of nearest neighbors are identified using cosine similarity / Euclidean distance. A priority queue is maintained to track the closest nodes, ensuring efficient traversal. These neighbors’ connections are then explored in the next lower layer. This process continues layer by layer until the bottom layer, where a detailed search retrieves the most relevant embeddings.

HNSW can miss embeddings if connections are poorly constructed specially on the higher layers. It is memory-intensive due to its multi-layer structure. Best for large datasets where efficiency justifies memory use.

Each indexing method has its benefits and challenges. FLAT index ensures consistent results with O(N) complexity (ideal for small datasets), while IVF (large datasets) and HNSW (massive datasets) offer faster but might miss some embeddings due to approximation. The choice depends on whether consistency, accuracy, or speed is your priority.

Which indexing method do you think fits your use case best? Let me know in the comments and stay tuned for future articles on advanced indexing techniques like PQ and LSH.

Abhay Ghanghav

Distributed systems, Storage, backup, Kubernetes, docker, Azure, AWS, OCI, python

1 个月

Absolutely bang on target Himanshu S., while solving indexing problem the most important aspect I believe is to understand own requirements, categorise the dataset and decide the algorithm

Sunit sonkar

iocl at Indian Oil Corporation Limited

2 个月

Interesting

Zoheb Khan

Storage | Virtualization | Distributed Filesystems(File/block/Object) | Backup & Recovery | Cloud | Docker | Kubernetes | Quality Assurance | Python Automation | Selenium | Pytest Framework

2 个月

Insightful

Imran Lateef

R&D Software Engineer (SDE 5) at Broadcom Inc.

2 个月

Great insights of indexing vs data retrieval .. Himanshu S.

Jyoti Sharma

Senior Engineering Manager at Cohesity

2 个月

Very informative

要查看或添加评论,请登录

Himanshu S.的更多文章

社区洞察

其他会员也浏览了