登录查看更多内容

Lamborghini RAG

Aman Kishore

Legal AI @ Harvey | Founder @ Mirage | AI @ NVIDIA

发布日期: 2025年1月22日

If you had $1 million to build a RAG system for someone how would you do it?

As companies adopt AI into more of their systems, the focus needs to be on reducing hallucinations and increasing accuracy. Since reasoning models are gaining popularity the next wave of AI applications will have very high latency. This is a great opportunity to rethink the v1.0 AI stack. Embedding models are built for low latency applications that can afford the loss of quality.

Agentic workflows have a higher bar for accuracy and are generally more expensive. This gives developers more flexibility when designing their AI stack. To design high accuracy systems I propose Lamborghini RAG: a RAG system that ensures the highest quality output from the LLM given a large corpus of data.

The Problem with Current RAG Systems

Current RAG systems are incredibly lossy and do a poor job at finding the correct answer. Today’s RAG systems use embedding models. Embedding models are great because they are optimize for latency and cost. However, embedding models have tradeoffs because they are incredibly lossy.

The basic RAG system today is simple.

These RAG systems do a pretty decent job at answering questions over data sources, however they frequently miss important information because the question’s embeddings don’t find the correct source embeddings.

Infinite Context RAG

The goal with Lamborghini RAG would be to mimic what I think the future of infinite context window LLMs would looks like.

So how do we mimic infinite context?

Only use LLMs.

For every question the user asks we pass every single source through an LLM. This would be exceedingly expensive but almost guarantees that the LLM will see the “correct” context required to fully answer any question.

Example: Private Equity Due Diligence

Scenario: Suppose a private equity firm is considering the acquisition of a large B2B enterprise software provider. This target company has thousands of customer contracts, each with unique service-level agreements (SLAs), pricing structures, and termination clauses. Plus, they’ve got a mountain of partnership agreements and vendor contracts. Any single “poison pill” clause (like early termination provisions triggered by a change of control) could blow up the deal’s economics.

领英推荐

Web3 meets AI: The Rise of the Agentic Web Economy

George Kassis 2 个月前

From Black Box to Glass Box: Mastering LLM…

Jagadeesh Rajarajan 4 天前

Building the most scalable experiment tracker + other…

neptune.ai 2 个月前

So how do you go through every single document and ensure that there are no “poison pills” in these documents. Clearly for this use-case Lamborghini RAG will yield higher accuracy results than Traditional RAG. Let’s compare the cost/latency of using both assuming 10k document at about 10M tokens.

Traditional RAG:

Document Processing with Embedding Models (10M tokens): ~$1.30

LLM Answering (200k context): ~$5

Lamborghini RAG:

Per Document LLM Answer: $20 * 10k = $10,000

Intermediate LLM Synthesis (every 10 documents): $5 * 1k = $5,000

Final LLM Answer Synthesis: $5

So Traditional RAG will cost about $6.30 and have very low latency (~30s), whereas Lamborghini RAG will cost about $15,005 and have extremely high latency (> 60min). But for a diligence workflow like this, a firm could stand to lose millions of dollars (or spend over $1 million contracting a law firm to run the due diligence).

So it’s worth asking, is your use case worth it?

Appendix:

Lamborgini RAG PseudoCode & Flow Chart

I have a hunch this is similar to what Google Deep Research does.

1. Initialize data structures:
   all_unanswered_questions = []
   questions_answer_pairs = {}   # key: question, value: (answer, sources_used)

2. Break down the user’s original question into multiple subquestions:
   subquestions = break_down_user_question(user_input)
   all_unanswered_questions.extend(subquestions)

3. While there are still unanswered questions:
   WHILE all_unanswered_questions IS NOT EMPTY:
       current_question = all_unanswered_questions.pop()

       FOR each document IN all_documents:
           # Check if the document is relevant to the current question
           IF is_relevant(document, current_question) IS TRUE:
               
               # Attempt to find an answer to the current_question from this document
               answer = extract_answer(document, current_question)
               
               IF answer IS NOT EMPTY:
                   # Store or update the answer and the source of information
                   questions_answer_pairs[current_question] = (
                       answer, 
                       questions_answer_pairs.get(current_question, (None, []))[1] + [document.source_name]
                   )
               
               ELSE:
                   # Identify new questions that arise from analyzing this document
                   new_questions = identify_new_questions(document, current_question)
                   
                   # Add any new unanswered questions to all_unanswered_questions
                   FOR question IN new_questions:
                       IF question NOT IN questions_answer_pairs AND question NOT IN all_unanswered_questions:
                           all_unanswered_questions.append(question)

4. After processing all documents (until no new unanswered questions appear):
   # Synthesize a final response
   final_answer = synthesize_answer(questions_answer_pairs)

5. Return or display the final answer, ensuring the response cites all sources used.
   OUTPUT final_answer

Harish Kamath

2 个月

Very interesting!

1 次回应

Sohan Choudhury

CEO of Flint (YC S23)

2 个月

Super interesting. Love the poison pill example

2 次回应

Amit Vikram

2 个月

Great article Aman !

2 次回应

查看更多评论

要查看或添加评论，请登录

Aman Kishore的更多文章

Market Intelligence From Precedents

2025年2月13日

Market Intelligence From Precedents

Market intelligence has long been the backbone of financial analysis and investment strategy. The ability to study…
Reasoning is the New Services

2025年1月30日

Reasoning is the New Services

Reasoning models will fundamentally change how we businesses run. It’s no longer just about automating tasks; now…

1 条评论
Text to 3D Scene Creation

2024年5月28日

Text to 3D Scene Creation

By Mirage ML Inc. Abstract This paper explores an innovative application of GPT-4, an advanced language model by…

1 条评论
Google Duplex - How it works & Implications

2018年7月24日

Google Duplex - How it works & Implications

Imagine a future where every task can be achieved with a simple voice command. Google gave us a sneak peek into what…

2 条评论

Lamborghini RAG

Aman Kishore

Legal AI @ Harvey | Founder @ Mirage | AI @ NVIDIA

The Problem with Current RAG Systems

Infinite Context RAG

Example: Private Equity Due Diligence

领英推荐

Appendix:

Lamborgini RAG PseudoCode & Flow Chart

Aman Kishore的更多文章

社区洞察

其他会员也浏览了

Tokens Per Second is Not All You Need

New Machine Learning Optimization Technique - Part I

Groq-3: The AI Accelerator That’s Changing the Game Like Never Before

Spring into the Newest in DevTools

Mathematical Representation of the Challenge-Response Cycle

Focusing on What to Focus on in 2025

Google releases a new generation of inference models.

Understanding Vector Norms: A Comprehensive Guide to L1, L2, L∞, and Beyond...

16th April 2023 - Web LLM, AgentGPT, DeepSpeed

To RAG or not to RAG: That is the question.

The Problem with Current RAG Systems

Infinite Context RAG

Example: Private Equity Due Diligence

领英推荐

Appendix:

Lamborgini RAG PseudoCode & Flow Chart

Aman Kishore的更多文章

Market Intelligence From Precedents

Reasoning is the New Services

Text to 3D Scene Creation

Google Duplex - How it works & Implications

社区洞察

其他会员也浏览了

Tokens Per Second is Not All You Need

New Machine Learning Optimization Technique - Part I

Groq-3: The AI Accelerator That’s Changing the Game Like Never Before

Spring into the Newest in DevTools

Mathematical Representation of the Challenge-Response Cycle

Focusing on What to Focus on in 2025

Google releases a new generation of inference models.

Understanding Vector Norms: A Comprehensive Guide to L1, L2, L∞, and Beyond...

16th April 2023 - Web LLM, AgentGPT, DeepSpeed

To RAG or not to RAG: That is the question.