登录查看更多内容

Similarity search is not a silver spoon!

Aman Dalmia

Hiring Senior ML Engineers, LLMs + Legal @HyperVerge

发布日期: 2024年8月29日

It is very easy to think that all you need to build a good RAG pipeline is to chunk your document using one of the splitters offered by LangChain, pass the chunks to an embedding model, and hook them up to a vector store that you can query at inference time to create the context for an LLM. I have spoken a bit about the importance of chunking and how hard it is to get it right for complex document types like PDFs. Let’s talk about similarity search today.

Relying only on similarity search could be a silent failure mode of your RAG pipeline. Let’s understand why:

Embedding models are encoder-style models where the last layer outputs a vector for every input token. These token-level embeddings need to be pooled to obtain a sentence-level embedding, which leads to a huge information loss. Embedding is essentially a form of lossy compression.
These models learn to prioritise the representation of those portions of a document that were required to answer the queries in its training data. However, the queries that your system needs to handle might require the model to focus on very different parts of the document. For similarity search, we typically embed the documents and queries separately. This means that the information compression in the documents (as described above) happens without any consideration for how the compressed embeddings are going to be used (i.e. without any consideration for the query). This removes any opportunity to let the query inform the compression.
These models are trained using a fixed, often outdated, vocabulary. So, it would not be able to accurately represent a word that has become common only recently (e.g. an obscure name of a newly released LLM).

What can we do?

Combine it with a keyword-search algorithm like BM25/TF-IDF. Why?

We, as humans, love using keywords. We are strongly inclined to notice and use certain acronyms and domain-specific jargons, that won’t be a part of the training data of these models.
BM25 is still a strong baseline that many SOTA models struggle to beat.
They give a free performance boost as they don’t add any compute overhead or cost during inference.

Use a reranker (e.g. cross-encoder or ColBERT) on the outputs of similarity search and keyword search:

Cross-encoders take both the raw query and document as input to predict a similarity score between them. This ensures that the full information of the query and the document is used, instead of just their compressed representations.

领英推荐

What is concept search?

Algolia 1 年前

Innovative Retrieval-Augmented Generation (RAG)…

Jaroslaw Sokolnicki 1 个月前

Creating Advanced Data-Driven GPTs Without APIs: Using…

Cohen Reuven 1 年前

ColBERT is a family of models which uses “late-stage interaction” to compute the similarity between a given query and document. Here, an encoder is still used independently on both the query and the document to generate token-level embeddings for each. However, instead of pooling the token-level?embeddings to obtain a sentence-level embedding, each token embedding in the query is compared with every token embedding in the document to find the maximum similarity for that query token. This is repeated for every query token and finally, the token-level scores are summed to obtain a score for the query-document pair.

Because of this, the ranking produced by a reranker is more reliable.
However, it is not practical to run this for every query-document pair because of the computation overhead.

Bringing it all together:

Given a query, use keyword search + similarity search on your documents to identify a smaller set of potentially relevant documents. In this step, you must optimise for ensuring that all the relevant documents are present in the output, even if it contains several irrelevant ones as well.
Use a reranker for every query-document pair on this smaller document set to rank the most relevant documents higher than the others and return the top few documents as the context for the LLM.

A note of caution: for every component you add/remove or any change you make to your retrieval pipeline, you must evaluate its impact on your evaluation dataset, instead of some generic benchmark data irrelevant to the task you care about, using retrieval-specific metrics like Precision@K, NDCG@k, Reciprocal Rank or any other metric suitable for your specific task.

I have combined my experience with my notes from the “Mastering LLMs” course (https://parlance-labs.com/education/) in this post, specifically the lectures “Back to Basics for RAG” and “Beyond the Basics of RAG”. Two of the diagrams have been taken from their slides too. I would highly recommend you to check it out!

Nishan Jain

ML Lead | Machine Learning | Generative AI | Data Science | Deep Learning

3 个月

While all these are important consideration in RAG systems, we often focus on model choices but overlook security. It's crucial to integrate access control (like RBAC or ABAC), pre-retrieval filtering, and query-time security checks to ensure only authorized users can access sensitive documents. This ensures compliance and protects against unauthorized access

5 次回应

要查看或添加评论，请登录

Aman Dalmia的更多文章

Answering the wrong question

2022年2月13日

Answering the wrong question

Over the years, I have been present in a lot of situations where people are meeting each other for the first time…
Taking the first stride in my climate journey and an invitation for you

2022年1月14日

Taking the first stride in my climate journey and an invitation for you

TL;DR: Given that this is a long post, if you just want to know about my next steps, you can skip to the section “My…

11 条评论
What are some useful tips for choosing and tweaking a convolutional neural network architecture and hyperparameters?

2017年6月26日

What are some useful tips for choosing and tweaking a convolutional neural network architecture and hyperparameters?

The Deep Learning Book has a separate chapter for this. I’ll recommend you to go through the same as it mentions a lot…

1 条评论
What’s the best way to start learning machine learning in Python?

2017年6月23日

What’s the best way to start learning machine learning in Python?

For starting with implementation of Machine learning algorithms in Python: It would be best to follow Udacity’s Intro…

Similarity search is not a silver spoon!

Aman Dalmia

Hiring Senior ML Engineers, LLMs + Legal @HyperVerge

领英推荐

Aman Dalmia的更多文章

社区洞察

其他会员也浏览了

Innovative Retrieval-Augmented Generation (RAG) Solutions in 2024: Classification, Frameworks, and Practical Combinations

Creating Advanced Data-Driven GPTs Without APIs: Using Decomposed URLs & Algorithmic Analysis

Tuning Information Retrieval in Agent Builder Search applications with Google Search?Adaptor.

How to Implement a GenAI Agent using Autogen or LangGraph

Dr RAG and Mr HyDE

Notes on Data Compression: Part 4 (JPEG)

Building an Efficient Data Scraper Tool : A Step-by-Step Guide to Algorithm Creation

Image Watermarking Using Computer Vision

Sparse Vector Using BM25

Embedding Distance To Enhanced Answer Quality: A Simple Dive

领英推荐

Aman Dalmia的更多文章

Answering the wrong question

Taking the first stride in my climate journey and an invitation for you

What are some useful tips for choosing and tweaking a convolutional neural network architecture and hyperparameters?

What’s the best way to start learning machine learning in Python?

社区洞察

其他会员也浏览了

Innovative Retrieval-Augmented Generation (RAG) Solutions in 2024: Classification, Frameworks, and Practical Combinations

Creating Advanced Data-Driven GPTs Without APIs: Using Decomposed URLs & Algorithmic Analysis

Tuning Information Retrieval in Agent Builder Search applications with Google Search?Adaptor.

How to Implement a GenAI Agent using Autogen or LangGraph

Dr RAG and Mr HyDE

Notes on Data Compression: Part 4 (JPEG)

Building an Efficient Data Scraper Tool : A Step-by-Step Guide to Algorithm Creation

Image Watermarking Using Computer Vision

Sparse Vector Using BM25

Embedding Distance To Enhanced Answer Quality: A Simple Dive