LLMs and RAG are Great, But Don’t Throw Away Your Inverted Index Yet

Vectors, embeddings, large language models (LLMs), and retrieval-augmented generation (RAG) represent the cutting edge of search architecture, and it is very tempting to believe we can dispense with the traditional inverted index architecture entirely. You should be excited about this brave new world, but you should also proceed with caution.

It is true that embedding-based retrieval addresses many pain points that challenge a traditional inverted index. Embeddings are less susceptible to polysemy (words having multiple meanings) and synonymy (multiple words having the same meaning). And embedding-based retrieval can be especially useful for handling long queries, especially compared to traditional methods like query expansion and query relaxation.

These sound like great arguments in favor of embedding-based retrieval. So what is the catch? Why are most companies still using a traditional — or at least a hybrid — architecture? Here are some of the main reasons.

Embedding-based retrieval is powerful, but it gains that power at the price of explainability. Vectors from embeddings tend to beless explainable than token-based representations. While a bag of words may not be a perfect representation of content, it is at least simple and understandable. In contrast, embeddings are a black box, making it hard to understand how they affect retrieval and ranking, and even harder to debug.

Embeddings also tend to be task-dependent. A single embedding model may not capture everything about a document or query. For example, in an e-commerce setting, an embedding might be more or less sensitive to variations in product type, brand, or size. Since embedding-based retrieval reduces relevance to a single similarity metric, there is a risk that a single vector representation will not address all search use cases. In contrast, token-based representations, despite being simplistic, are more flexible.

There are also computational challenges. Embeddings tend to be vectors with hundreds of densely populated dimensions. That is not necessarily a showstopper, especially if the documents they represent are large. Also, there are techniques to make the representations more compact. Still, index size matters, especially when vectors need to be kept in memory to minimize the latency of accessing them. Aside from scale concerns, exact nearest-neighbor search is not practical for most latency-sensitive applications, and even approximate nearest-neighbor (ANN) search is slower than performing simple set operations on an inverted index.

And then there is ranking. It is not clear how to combine the query-dependent similarity score with other ranking factors, particularly query-independent desirability factors. Ranking is never easy, but embedding-based retrieval introduces additional complexity.

Finally, there is the challenge of any operations that depend on retrieval, including result counts, filters or facets, and explicit sorts. As we discussed in the previous section, these are hard to implement well when we lack a principled way to manage the precision-recall tradeoff.

These are serious challenges! So it is important to go into embedding-based retrieval cautiously, recognizing that, for many applications, the costs of moving to embedding-based retrieval do not justify giving up the benefits of a traditional inverted index architecture. Or at least not yet.

Adrian Lopez

I help companies match with IT Experts Nearshore from Mexico. Scale up your team with guaranteed cultural fit.

2 个月

valuable contribution ????

Nishant Vyas

Applying for jobs manually? Automate with ApplyEngine.AI

11 个月

But is it cool anymore? ??

Dr. Ben E. Kuzey

Chief AI Officer at Microsoft NL | Max Planck Institute

11 个月
Rupesh Gupta

AI at LinkedIn

11 个月

> Embeddings also tend to be task-dependent. A single embedding model may not capture everything about a document or query. This is very true. We have observed this for several queries. That's why we decided to keep both keyword based and embedding based retrievers.

Abhimanyu Lad

Director of Engineering, LinkedIn Search

11 个月

Aw shucks, we just threw away our inverted index! But seriously – I'm pleasantly surprised by how well even off-the-shelf embeddings (like e5-small) are able to match keyword-based relevance for some of our use cases. But yes, we're not doing away with our inv idx just yet, for many of the reasons you mentioned. cc anand, Rupesh, Birjodh.

要查看或添加评论,请登录

Daniel Tunkelang的更多文章

  • ChatGPT, Are You Just Telling Me What I Want to Hear?

    ChatGPT, Are You Just Telling Me What I Want to Hear?

    These days, the Turing Test — which Turing originally called the “imitation game” — feels hopelessly outdated. With…

  • Not All Recall is Created Equal

    Not All Recall is Created Equal

    Search application developers constantly navigate tradeoffs, particularly between precision and recall. Precision…

    1 条评论
  • To Bot or Not to Bot: It Depends on the Question

    To Bot or Not to Bot: It Depends on the Question

    I was one of Quora’s earliest users. I earned Top Writer status for several years and even made some money through…

  • Ground Truth: A Useful Fiction

    Ground Truth: A Useful Fiction

    A key concern about AI is that models “hallucinate” — technical jargon for saying that they make up things that look…

    5 条评论
  • Conjunction, Disjunction, What’s Your Function?

    Conjunction, Disjunction, What’s Your Function?

    Like many folks of my generation, I grew up on Schoolhouse Rock, a series of animated educational shorts that aired…

  • Modeling Queries as Bags of Documents

    Modeling Queries as Bags of Documents

    Last week, I had the honor of presenting “Modeling Queries as Bags of Documents” at Search Solutions 2024 with Aritra…

  • Documents, Queries, and Categories

    Documents, Queries, and Categories

    I have published a number of posts and presentations about the bag-of-documents model, which essentially represents…

  • Where Do Categories Come From?

    Where Do Categories Come From?

    In my previous post, I argued that categories are fundamental for search applications. I characterized a robust set of…

    1 条评论
  • Categories are Fundamental for Search

    Categories are Fundamental for Search

    As a search consultant, I have learned to be flexible about structured data. However, I do insist on content being…

    5 条评论
  • Quo Vadis Nunc, Quora?

    Quo Vadis Nunc, Quora?

    I was one of Quora’s earliest users, earned Top Writer status for a few years, and topped the leaderboard as a 9-time…

    2 条评论

社区洞察

其他会员也浏览了