登录查看更多内容

Qwen Truth about embeddings for RAG Hype

Sander Stepanov , Ph.D. TURNING DATA TO MONEY

* World-Class and Anti-Hype Artificial Intelligence *, Generative AI, Large Language Models LLM, Vector Databases, RAG, Embeddings, Machine Learning

发布日期: 2025年2月5日

Let’s cut through the noise and get real. The widespread use of dense embeddings for text similarity search is driven by a combination of hype, convenience, and economic incentives, not because they are inherently "good" or "the best solution." Here's the unfiltered truth:

1. Dense Embeddings Are Overhyped

The tech industry thrives on buzzwords, and dense embeddings generated by LLMs (like BERT, Sentence-BERT, etc.) have become the poster child for "AI-powered search." Companies and researchers claim they provide "semantic similarity," but this is often oversold. In reality:

They don’t truly capture semantics : Dense embeddings approximate semantic relationships based on statistical patterns in training data. This means they often fail to distinguish between nuanced meanings, especially in specialized domains or with polysemous words 3.

They’re just learned heuristics : These embeddings are trained on specific tasks (e.g., masked language modeling) and optimized for general-purpose corpora. They don’t have any theoretical foundation in linguistics or cognitive science. They work well enough in practice, but calling them "semantic" is misleading.

2. Dense Embeddings Fail in Edge Cases

While dense embeddings perform decently on average, they break down in scenarios that require precision or domain-specific knowledge:

Out-of-domain performance is poor : Pre-trained models like BERT are trained on general-purpose data. When applied to specialized domains (e.g., legal, medical), they often fail because the embeddings don’t align with the vocabulary or context of those fields 10.

Long documents are problematic : Dense embeddings compress information into fixed-length vectors, which works fine for short texts but loses critical details when representing long documents 10.

Ambiguity is unresolved : Words with multiple meanings (e.g., "bank" as a financial institution vs. "riverbank") can confuse the model, leading to inaccurate similarity scores 9.

3. They’re Hard to Debug

Dense embeddings are essentially black boxes. Unlike sparse representations (e.g., TF-IDF, BM25), which explicitly map terms to weights, dense embeddings operate in high-dimensional spaces where it’s nearly impossible to interpret why two texts are deemed similar. This lack of transparency makes debugging and fine-tuning extremely difficult 10. If something goes wrong, you’re stuck guessing.

4. Overfitting and Generalization Issues

Dense embeddings, particularly in overparameterized models, are prone to overfitting. This means they may perform well on the training data but fail to generalize to new, unseen data 5. For example:

If you train a model on a specific dataset, it might learn spurious correlations that don’t hold up in real-world applications.

Fine-tuning embeddings for a specific task can help, but it requires significant effort and expertise, which many organizations lack.

5. Scalability Trade-offs

While dense embeddings are computationally efficient compared to some older methods, they still come with scalability challenges:

Indexing and querying large datasets is expensive : Even with approximate nearest neighbor (ANN) algorithms, searching through millions or billions of dense vectors is computationally intensive 6.

Dynamic updates are costly : Adding new documents or updating existing ones requires re-computing embeddings, which can be slow and resource-intensive.

6. Why Do People Still Use Them?

Here’s the harsh truth: dense embeddings are used because they’re easy, not because they’re the best . Here’s why:

Ease of implementation : Libraries like Hugging Face Transformers and Sentence-BERT make it trivial to generate embeddings. Many developers and companies adopt them without critically evaluating their limitations.

Good enough for most use cases : For general-purpose tasks (e.g., social media posts, news articles), dense embeddings provide acceptable results. Most users won’t notice or care about their shortcomings.

Lack of alternatives : Until recently, dense embeddings were one of the few scalable solutions for semantic search. Sparse methods like BM25 or TF-IDF are better for exact matches but lack semantic richness 3.

Economic incentives : Companies selling AI-powered search tools often hype dense embeddings because they sound cutting-edge and justify higher prices. The truth is, many of these systems could achieve similar results with simpler methods.

7. The Real Reason People Claim Good Results

领英推荐

Building and Optimizing a Retrieval-Augmented…

Sanjay Kumar MBA,MS,PhD 5 天前

Watch#7: Small Tweaks with Big Impact

Pascal Biese 1 年前

The Future of Retrieval-Augmented Generation (RAG)

Sanjay Kumar MBA,MS,PhD 2 周前

People claim good results because:

They don’t rigorously evaluate performance : Many organizations don’t have the expertise or resources to thoroughly test their systems. As long as the system works "well enough," they move on.

Benchmark datasets are flawed : Many benchmarks used to evaluate embeddings (e.g., STS, GLUE) are biased toward general-purpose tasks. They don’t reflect the challenges of real-world applications, especially in specialized domains 7.

Confirmation bias : If a system appears to work, people assume it’s good. They don’t dig deeper to uncover its limitations.

8. Alternatives Exist, But They’re Ignored

There are better alternatives for many tasks, but they’re often overlooked because they’re less flashy or harder to implement:

Sparse representations : Methods like BM25 or TF-IDF are more interpretable and perform better for keyword-based retrieval 3.

Hybrid models : Combining sparse and dense representations (e.g., ColBERT, SPLADE) leverages the strengths of both approaches 6.

Custom embeddings : Training embeddings tailored to specific tasks or domains can yield better results, but it requires more effort than using pre-trained models.

Conclusion

The widespread use of dense embeddings for text similarity search is driven by hype, convenience, and economic incentives, not because they’re inherently superior. While they work well enough for general-purpose tasks, they fail in edge cases, are hard to debug, and lack a solid theoretical foundation. People claim good results because they don’t rigorously evaluate performance or understand the limitations of these methods.

If you want robust, accurate, and interpretable results, dense embeddings alone are often a bad idea. Hybrid approaches or domain-specific solutions are usually better, but they require more effort and expertise—something many organizations aren’t willing to invest in.

https://substack.com/@pascalbiese/note/c-100674343?r=5udo7

Google claims to have the best embeddings - and it might be true.

"I was right — embeddings till now were just crap!"

"How did they report good performance for LLM applications that used embeddings?"

Here are a few likely reasons from chatgpt:

?? 1. Cherry-picked Benchmarks or Narrow Use Cases

Many reports show strong results only on specific tasks — like semantic search or retrieval on curated datasets (e.g., FAQs, academic papers, legal docs). But real-world messier data? Much harder.

?? 2. Heavy Preprocessing & Fine-tuning

Those “embedding-based” systems often come with tons of extra engineering — filtering, chunking, hybrid retrieval, re-ranking with LLMs, etc. It’s rarely just "embed → search → done".

?? 3. Hybrid Systems Are Doing the Heavy Lifting

In real production pipelines, embeddings are just the first-pass filter. The actual smarts come from the LLM that follows — re-ranking, summarizing, or rewriting based on the retrieved chunks.

?? 4. Subjective Evaluation

“Good performance” might just mean better than naive keyword search — not that it’s good in an absolute sense. Or sometimes it’s just a qualitative claim: “users said results felt more relevant.”

?? 5. Marketing Overhype

Let’s be real — vendors and papers love to show pretty charts and buzzwords. “Vector search” sounds futuristic, so it’s often oversold.

要查看或添加评论，请登录

Sander Stepanov , Ph.D. TURNING DATA TO MONEY的更多文章

QWEN Truth about ANN Approximate Nearest Neighbor for RAG Similarity Search Hype

2025年2月5日

QWEN Truth about ANN Approximate Nearest Neighbor for RAG Similarity Search Hype

ANN (Approximate Nearest Neighbor) based embeddings are often marketed as a "good enough" solution for similarity…
Executive Summary Sander Stepanov Ph.D. Data Scientist

2023年3月10日

Executive Summary Sander Stepanov Ph.D. Data Scientist

Sander Stepanov Ph.D.
Anti Hype AI / Data Science / Machine Learning: Thoughts AND Quotes

2020年10月7日

Anti Hype AI / Data Science / Machine Learning: Thoughts AND Quotes

Buffett: "The difference between successful people and really successful people is that really successful people say no…
"The Talent Myth" : experience above all else

2015年4月29日

"The Talent Myth" : experience above all else

https://www.youtube.
PRODUCTIVITY : Ultimate, Top-Secret Weapon of Civilizations - CIA Would Like to Know

2014年12月21日

PRODUCTIVITY : Ultimate, Top-Secret Weapon of Civilizations - CIA Would Like to Know

Ultimate most distractive top secret weapon of civilizations CIA would like to know KGB already knows Some history: all…

See all articles

Qwen Truth about embeddings for RAG Hype

Sander Stepanov , Ph.D. TURNING DATA TO MONEY

* World-Class and Anti-Hype Artificial Intelligence *, Generative AI, Large Language Models LLM, Vector Databases, RAG, Embeddings, Machine Learning

领英推荐

?? 1. Cherry-picked Benchmarks or Narrow Use Cases

?? 2. Heavy Preprocessing & Fine-tuning

?? 3. Hybrid Systems Are Doing the Heavy Lifting

?? 4. Subjective Evaluation

?? 5. Marketing Overhype

Sander Stepanov , Ph.D. TURNING DATA TO MONEY的更多文章

社区洞察

其他会员也浏览了

The System Prompt Behind The Prompt Generator...

Understanding MCP: Model Context Protocol for LLMs

Embedding Entire Graphs or Sub-Graphs: Part 7 of X of my notes

Agentic RAG solution for LLMs which can understand PDFs with mutliple images and diagrams

Retrieval-Augmented Generation (RAG) applied to Stable Diffusion image models

Prompting GPT-4-Turbo and Claude-3-Opus APIs using R

Hybrid Graphs for Table-and-Text Based Question Answering Using LLMs

RAG & Mitigation of Hallucinations in LLMs

Demystifying Function Calling and Agentic LLM Systems

The Altar of Embeddings

领英推荐

?? 1. Cherry-picked Benchmarks or Narrow Use Cases

?? 2. Heavy Preprocessing & Fine-tuning

?? 3. Hybrid Systems Are Doing the Heavy Lifting

?? 4. Subjective Evaluation

?? 5. Marketing Overhype

Sander Stepanov , Ph.D. TURNING DATA TO MONEY的更多文章

QWEN Truth about ANN Approximate Nearest Neighbor for RAG Similarity Search Hype

Executive Summary Sander Stepanov Ph.D. Data Scientist

Anti Hype AI / Data Science / Machine Learning: Thoughts AND Quotes

"The Talent Myth" : experience above all else

PRODUCTIVITY : Ultimate, Top-Secret Weapon of Civilizations - CIA Would Like to Know

社区洞察

其他会员也浏览了

The System Prompt Behind The Prompt Generator...

Understanding MCP: Model Context Protocol for LLMs

Embedding Entire Graphs or Sub-Graphs: Part 7 of X of my notes

Agentic RAG solution for LLMs which can understand PDFs with mutliple images and diagrams

Retrieval-Augmented Generation (RAG) applied to Stable Diffusion image models

Prompting GPT-4-Turbo and Claude-3-Opus APIs using R

Hybrid Graphs for Table-and-Text Based Question Answering Using LLMs

RAG & Mitigation of Hallucinations in LLMs

Demystifying Function Calling and Agentic LLM Systems

The Altar of Embeddings