登录查看更多内容

Vector DB Indexing Internals for RAG Applications

Himanshu S.

Engineering Leader, ML & LLM / Gen AI enthusiast, Senior Engineering Manager @ Cohesity | Ex-Veritas | Ex-DDN

发布日期: 2024年12月26日

Since my last article, I was split on what to write. One option was the Willow chip with 105 qubits, announced by Google, which sparked speculative correlations with the multiverse. However, I decided to focus on something more practical: AI applications using RAG.

A critical component of RAG is Vector Databases (Vector DBs), which are the centerpiece for enabling efficient similarity search and retrieval. Within Vector DBs, there is a diverse range of indexing algorithms, each targeted at specific use cases and performance requirements. In this article, I will dive deeper into how these indexing methods are designed and how they optimize search efficiency.

Imagine a company facing a lawsuit for failing to deliver on a contract. It needs to provide documents, communications, logs, and other evidence to counter the claim. Luckily, its backup software indexed data during backups. This includes emails, files, and logs, making them searchable. The software links related items, like emails to their attachments and logs. A query like “Contract Project Jarvis” could return communications, approvals and modifications showing delays caused by external factors.

There are multiple use cases that the customer may have.

1.?It could want to retrieve all the data related to the project using a brute force method. (Flat Index /Brute Force)

2.It could want to look at the data specific cluster wise. E.g. Legal, Pre Sales, Sales, technical. (Cluster / IVF)

3. It could want to search documents that not only have direct relation but also some context-based connection with the search query. (HNSW)

While I’m not specifically addressing how this data is stored in a Vector DB, if you’ve read my articles on the framework of Transformers, you’ll know that features or tokens are represented as embeddings. Similarly, in Vector DBs, all features are represented as embeddings. These embeddings, like in Transformer models, capture the semantic meaning of the data. The Vector DB itself handles the creation of these embeddings through its built-in functions.

E.g.

documents = [

"Contract negotiation with Client X.",

"Delivery issues reported on Project Jarvis.",

"Payment details for invoice #12345."

]

#Generate Embeddings:

embeddings = model.encode(documents)

For now, let’s not worry about how these embeddings look like. They come directly from the learned weights of the LLM during its training, capturing the semantic meaning of the data. In simple words, document (broken into smaller parts) is tokenized and passed through the model’s layers where the learnt weights are applied. The result is a fixed-size vector embedding, which is then inserted into the vector database for efficient storage and retrieval. After adding embeddings, the Vector DB creates an index for fast retrieval.

Before diving into how these indexes are set up, let’s first understand two foundational metrics—Euclidean Distance and Cosine Similarity. Both play a critical role in determining how similar embeddings are and influence the retrieval process.

Euclidean Distance: Euclidean Distance measures the straight-line distance between two points in a multidimensional space. You can look at it from a two-dimensional point of view as 2 points in X, Y space.

We can imagine A(X1, Y1) and B(X2, Y2) then Euclidean distance can be represented as following.

For N dimensional space we can write it as following.

Why are terms subtracted, squared, and then square-rooted? This is done to avoid cancellation of terms during summation. Squaring ensures that all terms contribute positive values, regardless of direction. Since squaring the terms introduces non-linearity, we take the square root at the end to return to a linear scale, giving us the true distance. There are variations of the Euclidean distance, but they could be discussed as an when they arise.

Cosine Similarity: You can imagine a 3D space where every feature is there at a coordinate. One can root them to the center (0,0,0) and then distance between these two points can be considered as a function of angle between them.

Since these two are vectors, they have both the direction and magnitude. In vector form the equation for A.B can be seen as follow. A ? B = ∥A∥ ∥B∥ cos θ

Cosine similarity = A.B / ∥A∥ ∥B∥ where ∥A∥ and ∥B∥ are the magnitude of the vector and A.B is the dot product.

领英推荐

Databricks just dropped an LLM bomb – DBRX ??

AIM Research 11 个月前

Understanding Big O Notation, Time and Space Complexity

Softray Solutions 1 年前

Learn what’s coming with Milvus 2.5, RAG Evaluation…

Milvus 2 个月前

E.g.

Now that we understand the basic metrics, let’s explore the indexing methods used in Vector Dbs.

Flat Index (Brute Force)

We’ll start with the simplest—Flat Index (Brute Force)—where data is stored linearly without any hierarchical relationships. You can imagine it as shown below, where the green dots represent the stored embeddings.

Searching through this data means going through all the points one by one to find the nearest matches for the query embedding. As shown in the graph, every point is checked sequentially, which ensures accuracy but can be slow for large datasets.

Flat Index compares every point to the query embedding using Cosine similarity or Euclidean distance. It ensures accuracy but takes O(N) time, making it slow for large datasets.

IVF (Inverted File Index)

Geometrically, you can imagine this as certain centroid points in space around which the embeddings (data points) are clustered. These centroids could represent categories such as sales, pre-sales, technical, and legal. Each embedding, which is a numerical representation of the data, is assigned to the nearest centroid by calculating either cosine similarity or Euclidean distance. Based on these distances, embeddings are stored under their respective centroids, allowing efficient searches by limiting comparisons to only the embeddings within the relevant cluster.

This process of organizing embeddings into clusters is called k-means clustering, a method where data points are grouped around centroids based on their proximity. IVF leverages this extensively to create a structure that facilitates faster searches by narrowing down to relevant clusters.

You can visualize how centroids are created, and the embeddings are centered around it by following figures.

The search works by first comparing the query embedding (vector) to the centroids in the index using distance metrics like cosine similarity or Euclidean distance. Based on these comparisons, the closest clusters (as determined by their centroids) are selected for further search. This approach significantly reduces the number of comparisons compared to brute force. However, in cases of mis-mapping, where the query embedding is closer to a centroid but not all relevant data points, some results might be missed.

HNSW (Hierarchical Navigable Small World)

In HNSW, all embeddings are called as nodes. Like in previous case, each node does cosine similarity or Euclidean distance calculation to figure out its nearest neighbors.

All the nodes / embeddings are inserted into the index starting from the lowest layer. Suppose there are N nodes and a maximum of M possible connections per node. At the bottom layer, all N nodes are present, and each node forms up to M connections with its nearest Neighbours.

Nodes are assigned a random value r between 0 and 1. A threshold P(l)= p^l, where p=1/e, determines if a node moves to the next layer. Nodes with r ≤ P(l) are promoted, ensuring fewer nodes in higher layers. Each promoted node in the new layer forms up to M connections with its nearest neighbors. If M connections cannot be made, the node forms as many as possible below M.

This process repeats for each subsequent layer, with P(l) decreasing exponentially (N(l)=N(l-1) ?p^l), ensuring fewer nodes in higher layers and sparsity.

When adding a new embedding to the graph, efConstruction dictates how many neighbors to evaluate at each layer. This ensures the new embedding forms better connections, improving graph quality.

How does the search work?

As in IVF and flat search, the query is first converted into an embedding using a relevant embedding model and then the embedding is passed to the Vector DB APIs for retrieval. The search begins at the top layer, starting from a single-entry point, which is chosen randomly or pre-selected. A fixed number of nearest neighbors are identified using cosine similarity / Euclidean distance. A priority queue is maintained to track the closest nodes, ensuring efficient traversal. These neighbors’ connections are then explored in the next lower layer. This process continues layer by layer until the bottom layer, where a detailed search retrieves the most relevant embeddings.

HNSW can miss embeddings if connections are poorly constructed specially on the higher layers. It is memory-intensive due to its multi-layer structure. Best for large datasets where efficiency justifies memory use.

Each indexing method has its benefits and challenges. FLAT index ensures consistent results with O(N) complexity (ideal for small datasets), while IVF (large datasets) and HNSW (massive datasets) offer faster but might miss some embeddings due to approximation. The choice depends on whether consistency, accuracy, or speed is your priority.

Which indexing method do you think fits your use case best? Let me know in the comments and stay tuned for future articles on advanced indexing techniques like PQ and LSH.

Abhay Ghanghav

Distributed systems, Storage, backup, Kubernetes, docker, Azure, AWS, OCI, python

1 个月

Absolutely bang on target Himanshu S., while solving indexing problem the most important aspect I believe is to understand own requirements, categorise the dataset and decide the algorithm

1 次回应

Sunit sonkar

iocl at Indian Oil Corporation Limited

2 个月

Interesting

1 次回应

Zoheb Khan

2 个月

Insightful

1 次回应

Imran Lateef

R&D Software Engineer (SDE 5) at Broadcom Inc.

2 个月

Great insights of indexing vs data retrieval .. Himanshu S.

1 次回应

Jyoti Sharma

Senior Engineering Manager at Cohesity

2 个月

Very informative

1 次回应

查看更多评论

要查看或添加评论，请登录

Himanshu S.的更多文章

Knowledge Distillation in Deep Learning: Part 1

2025年3月4日

Knowledge Distillation in Deep Learning: Part 1

Key to learning is not remembering but understanding. What is understanding? We will get to that question a little…

2 条评论
Studying Deep Learning Through the Lens of Information Theory

2025年2月20日

Studying Deep Learning Through the Lens of Information Theory

In the summer of 1943, Claude Shannon met Alan Turing. Turing said, "Machines could think.

5 条评论
Distillation of DeepSeek's Reinforcement Learning Mechanism

2025年2月8日

Distillation of DeepSeek's Reinforcement Learning Mechanism

DeepSeek as the name suggests is most sought-after name in the industry now. Beyond its name, it has also introduced…

4 条评论
Distillation of paper "A Mathematical Perspective on Transformers"

2025年2月1日

Distillation of paper "A Mathematical Perspective on Transformers"

While exploring the latest papers on Transformers, I came across "A Mathematical Perspective on Transformers." Although…

1 条评论
UMAP: If Geometry Could Talk

2025年1月23日

UMAP: If Geometry Could Talk

In a thought experiment recently, I was thinking about various frameworks of symmetry and asymmetry and how abstract…
t-SNE: Evolving Symmetry of High-Dimensional Data in Low Dimension space

2025年1月16日

t-SNE: Evolving Symmetry of High-Dimensional Data in Low Dimension space

For this article, I am going to create a solid use case of something that has interested me for years. It’s cancer…

3 条评论
Role of Kernel PCA in Non-Linear Dimensionality Reduction

2025年1月9日

Role of Kernel PCA in Non-Linear Dimensionality Reduction

Kernel PCA is a powerful dimensionality reduction technique that transforms non-linear patterns into linear ones by…

3 条评论
Optimizing the Efficiency of Vector Databases with Principal Component Analysis (PCA)

2025年1月1日

Optimizing the Efficiency of Vector Databases with Principal Component Analysis (PCA)

In the last article, we discussed indexing techniques for embedding retrieval in VectorDBs. We explored multiple…

3 条评论
Multi-Headed Attention and Backpropagation in Transformer Model

2024年12月23日

Multi-Headed Attention and Backpropagation in Transformer Model

The key to understanding an act is its effect, not just the act itself. We often get stuck in the process and lose the…

1 条评论
Gradient Descent #2: Quantum-Like Optimizer Evolution

2024年12月19日

Gradient Descent #2: Quantum-Like Optimizer Evolution

One of the beauties of machine learning and neural networks is how they turn the concept of learning into mathematical…

1 条评论

See all articles

Vector DB Indexing Internals for RAG Applications

Himanshu S.

Engineering Leader, ML & LLM / Gen AI enthusiast, Senior Engineering Manager @ Cohesity | Ex-Veritas | Ex-DDN

领英推荐

Flat Index (Brute Force)

IVF (Inverted File Index)

HNSW (Hierarchical Navigable Small World)

Himanshu S.的更多文章

社区洞察

其他会员也浏览了

Navigating GenAI RAG Complexity: A Guide for IT Leaders on Building AI Capabilities

Unique list of open-source vector databases, libraries, and versatile platforms with vector functionality

Part V: Bringing it all together - RAG with ollama and Open WebUI

Practical Strategies to Enhance LLMs Performance!

?? Massive Progress in Reasoning Models

Top 10 Pharma Company Uses Ontotext’s Solution for Smarter Search Across Siloed Structured and Unstructured Data

Spotlight on Databricks RAG Tools, Vector Search, Feature & Function Serving

How to Implement a GenAI Agent using Autogen or LangGraph

Creating Advanced Data-Driven GPTs Without APIs: Using Decomposed URLs & Algorithmic Analysis

Notes on Data Compression: Part 4 (JPEG)

领英推荐

Flat Index (Brute Force)

IVF (Inverted File Index)

HNSW (Hierarchical Navigable Small World)

Himanshu S.的更多文章

Knowledge Distillation in Deep Learning: Part 1

Studying Deep Learning Through the Lens of Information Theory

Distillation of DeepSeek's Reinforcement Learning Mechanism

Distillation of paper "A Mathematical Perspective on Transformers"

UMAP: If Geometry Could Talk

t-SNE: Evolving Symmetry of High-Dimensional Data in Low Dimension space

Role of Kernel PCA in Non-Linear Dimensionality Reduction

Optimizing the Efficiency of Vector Databases with Principal Component Analysis (PCA)

Multi-Headed Attention and Backpropagation in Transformer Model

Gradient Descent #2: Quantum-Like Optimizer Evolution

社区洞察

其他会员也浏览了

Navigating GenAI RAG Complexity: A Guide for IT Leaders on Building AI Capabilities

Unique list of open-source vector databases, libraries, and versatile platforms with vector functionality

Part V: Bringing it all together - RAG with ollama and Open WebUI

Practical Strategies to Enhance LLMs Performance!

?? Massive Progress in Reasoning Models

Top 10 Pharma Company Uses Ontotext’s Solution for Smarter Search Across Siloed Structured and Unstructured Data

Spotlight on Databricks RAG Tools, Vector Search, Feature & Function Serving

How to Implement a GenAI Agent using Autogen or LangGraph

Creating Advanced Data-Driven GPTs Without APIs: Using Decomposed URLs & Algorithmic Analysis

Notes on Data Compression: Part 4 (JPEG)