登录查看更多内容

Retrieval Techniques

Jashneet Kaur

Research Associate @IIIT Delhi | Data Scientist | NLP & LLM Enthusiast

发布日期: 2024年9月30日

Information retrieval (IR) is the process of finding relevant information from large collections of unstructured data, such as text documents, websites, or databases. The goal is to match a user’s query with documents that best satisfy the search intent. Think of techniques—sparse and dense retrieval as the backbone of search engines, helping us find relevant articles, books, or any other form of content.

?? Sparse Retrievers: Classic Keyword-Based Search

Sparse retrievers are based on keyword matching. They represent text by counting how often each word appears in a document. Since only a small portion of the total vocabulary is used in any given document, most of the counts are zero, which is why it’s called a sparse representation.

?? How They Work:

Sparse retrievers use inverted indexes, which track documents containing specific words or phrases. This makes retrieval very fast. Given a query, the system checks the index to quickly find all documents that contain the query terms. Then, the documents are ranked based on how many times the query terms appear.

However, not all words are equally important. Some words, like "the" or "is," appear in almost every document and don't help in distinguishing between relevant and irrelevant results. To solve this, sparse retrieval methods use Term Frequency - Inverse Document Frequency (TF-IDF).

?? Term Frequency - Inverse Document Frequency (TF-IDF):

TF-IDF weights words based on how often they appear in a document compared to how often they appear across all documents. Words that are common across many documents (like “and” or “the”) are given less weight, while words that are more unique to certain documents are given more importance.

Query: "data science job"
Document: If a document contains "data" and "science" frequently but mentions "job" only a few times, the TF-IDF score will adjust the relevance based on how common those terms are in other documents.

?? Key Features of Sparse Retrievers:

Exact Keyword Matching: The system retrieves documents based on exact query terms.
Efficient Retrieval: Thanks to inverted indexes, sparse retrievers can quickly find documents matching the query terms.
Human-Readable: It’s easy to understand why a document was retrieved, as it directly depends on the words used.

?? Example - BM25:

Imagine you search for “best pizza in New York.” A sparse retriever like BM25 will search for documents containing these exact words, and score them based on how frequently the words appear, factoring in word importance (like how common the word "best" might be across all documents).

?? Dense Retrievers: Understanding the Meaning Behind Words

Dense retrievers take a more modern approach by focusing on the meaning behind words rather than just matching exact terms. They use neural network-based embeddings to represent both queries and documents as dense vectors—continuous representations that capture the underlying semantics of the text.

领英推荐

AI Builders Summit Week 1 Takeaways, 10 Trending Open…

Open Data Science Conference (ODSC) 1 个月前

Data Insights for Everyone — The Semantic Layer to the…

Kirk Borne, Ph.D. 3 年前

Elevating RAG with Ensemble Techniques: Unlocking…

Snigdha Kakkar 11 个月前

?? How They Work:

Dense retrievers encode both the query and the documents into dense vectors (low-dimensional vectors of real numbers). These vectors are placed in a shared semantic space, where the distance between vectors represents how similar they are in meaning. As a result, dense retrievers can match documents based on semantic similarity, even when the exact query words aren’t present.

Query: "data science job"
Document: Even if the document doesn't contain the exact phrase "data science job," a dense retriever can identify that a document about "machine learning careers" is relevant.

?? Key Features of Dense Retrievers:

Semantic Matching: Rather than matching exact keywords, dense retrievers match based on the meaning of the query.
Pre-trained Neural Networks: Dense retrievers often rely on models like BERT or DPR that have been pre-trained on large datasets.
More Computationally Intensive: Dense retrieval requires more computational resources, including GPUs, to handle the heavy lifting of neural network computations.

?? Example - Dense Passage Retrieval (DPR):

If you search for “top places to eat in New York,” a dense retriever can find documents that mention “best dining experiences in NYC,” even if the words “places” or “eat” aren’t in the document. This is because dense retrievers understand that “eat” and “dining” are semantically related.

?? When to Use Each?

Sparse Retrievers (BM25, TF-IDF):

Best for exact keyword matching.
Useful when computational resources are limited.
Works well for simple, keyword-based queries.

Dense Retrievers (DPR, BERT):

Ideal for semantic search, where meaning matters more than exact keywords.
Suitable for applications where queries are longer or more complex.
Great for handling synonyms and paraphrases.

?? Conclusion:

In NLP, sparse retrievers are fast and efficient for keyword-based searches, making them ideal for applications where exact matches are crucial. On the other hand, dense retrievers are more powerful for semantic understanding, making them better suited for applications like question answering, conversational AI, or semantic search.

For the best of both worlds, many modern systems combine sparse and dense retrieval techniques in a hybrid approach, where sparse retrievers quickly narrow down relevant documents and dense retrievers provide a deeper, meaning-based ranking.

Jobanpreet Singh

Associate Data Scientist | M.Tech (AI) | Generative AI Enthusiast

5 个月

great share ?? Jashneet Kaur

1 次回应

Jayveer Nanda

Director Machine Learning & Engineering | Data Science, Machine Learning, Artificial Intelligence, Generative AI & Business Analytics Mentor | SME| Consultant | DS & AI ML Trainer

5 个月

Very informative Jashneet Kaur.. keep up the good work :)

1 次回应

Sahibpreet Singh

Data Scientist and AI Enthusiast | Technical Writer, Udacity Bertelsmann Scholar

5 个月

Good work bud??

查看更多评论

要查看或添加评论，请登录

Jashneet Kaur的更多文章

Chunking in Retrieval-Augmented Generation (RAG) and it's Types:

2024年10月3日

Chunking in Retrieval-Augmented Generation (RAG) and it's Types:

Chunking refers to breaking down large documents or texts into smaller, manageable sections or "chunks". In RAG systems…

6 条评论
From Scratch BM25 and it's Variant's

2024年10月2日

From Scratch BM25 and it's Variant's

BM25 is a popular algorithm used to find the most relevant documents for a given search query. It does this by looking…

4 条评论
Exploring Decision Trees: The Branching Paths of Data

2023年10月19日

Exploring Decision Trees: The Branching Paths of Data

A decision tree is a Non-parametric (doesn't assume that your data follows a specific shape or pattern) supervised…

7 条评论
"Transcription Titans: AssemblyAI, Whisper, and Wav2Vec2 Engage in an Epic Battle for Excellence!" ????

2023年10月3日

"Transcription Titans: AssemblyAI, Whisper, and Wav2Vec2 Engage in an Epic Battle for Excellence!" ????

Welcome to the world of speech transcription! ??? It's a crucial tool across numerous domains, including customer…

5 条评论
Unlocking the Power of Text ????

2023年9月25日

Unlocking the Power of Text ????

In the realm of Natural Language Processing, text representation techniques are the magic behind converting words into…

See all articles

Retrieval Techniques

Jashneet Kaur

Research Associate @IIIT Delhi | Data Scientist | NLP & LLM Enthusiast

?? Sparse Retrievers: Classic Keyword-Based Search

?? How They Work:

?? Term Frequency - Inverse Document Frequency (TF-IDF):

?? Key Features of Sparse Retrievers:

?? Example - BM25:

?? Dense Retrievers: Understanding the Meaning Behind Words

领英推荐

?? How They Work:

?? Key Features of Dense Retrievers:

?? Example - Dense Passage Retrieval (DPR):

?? When to Use Each?

?? Conclusion:

Jashneet Kaur的更多文章

社区洞察

其他会员也浏览了

Building an Efficient Data Scraper Tool : A Step-by-Step Guide to Algorithm Creation

How Graph RAG Improves Information Retrieval

My Journey into Information Retrieval: A Summary of the First Three Chapters of "Information Retrieval in Practice"

Revolutionising Data Management: Jaiinfoway’s Journey with Retrieval Augmented Generation (RAG) to Enhance Internal Efficiency

Embedding Distance To Enhanced Answer Quality: A Simple Dive

Why Your Vector Database Needs a Formula 1 Upgrade, Information Retrieval Algorithms, and More!

Journey To Database World: Part 10 (Vector Database - Qdrant As Example)

From PDFs to Insights: Qdrant Vector Search Explained

Build an Advanced RAG App: Query Routing

When Short Queries Meet Long Documents:

?? Sparse Retrievers: Classic Keyword-Based Search

?? How They Work:

?? Term Frequency - Inverse Document Frequency (TF-IDF):

?? Key Features of Sparse Retrievers:

?? Example - BM25:

?? Dense Retrievers: Understanding the Meaning Behind Words

领英推荐

?? How They Work:

?? Key Features of Dense Retrievers:

?? Example - Dense Passage Retrieval (DPR):

?? When to Use Each?

?? Conclusion:

Jashneet Kaur的更多文章

Chunking in Retrieval-Augmented Generation (RAG) and it's Types:

From Scratch BM25 and it's Variant's

Exploring Decision Trees: The Branching Paths of Data

"Transcription Titans: AssemblyAI, Whisper, and Wav2Vec2 Engage in an Epic Battle for Excellence!" ????

Unlocking the Power of Text ????

社区洞察

其他会员也浏览了

Building an Efficient Data Scraper Tool : A Step-by-Step Guide to Algorithm Creation

How Graph RAG Improves Information Retrieval

My Journey into Information Retrieval: A Summary of the First Three Chapters of "Information Retrieval in Practice"

Revolutionising Data Management: Jaiinfoway’s Journey with Retrieval Augmented Generation (RAG) to Enhance Internal Efficiency

Embedding Distance To Enhanced Answer Quality: A Simple Dive

Why Your Vector Database Needs a Formula 1 Upgrade, Information Retrieval Algorithms, and More!

Journey To Database World: Part 10 (Vector Database - Qdrant As Example)

From PDFs to Insights: Qdrant Vector Search Explained

Build an Advanced RAG App: Query Routing

When Short Queries Meet Long Documents: