Lamborghini RAG
If you had $1 million to build a RAG system for someone how would you do it?
As companies adopt AI into more of their systems, the focus needs to be on reducing hallucinations and increasing accuracy. Since reasoning models are gaining popularity the next wave of AI applications will have very high latency. This is a great opportunity to rethink the v1.0 AI stack. Embedding models are built for low latency applications that can afford the loss of quality.
Agentic workflows have a higher bar for accuracy and are generally more expensive. This gives developers more flexibility when designing their AI stack. To design high accuracy systems I propose Lamborghini RAG: a RAG system that ensures the highest quality output from the LLM given a large corpus of data.
The Problem with Current RAG Systems
Current RAG systems are incredibly lossy and do a poor job at finding the correct answer. Today’s RAG systems use embedding models. Embedding models are great because they are optimize for latency and cost. However, embedding models have tradeoffs because they are incredibly lossy.
The basic RAG system today is simple.
These RAG systems do a pretty decent job at answering questions over data sources, however they frequently miss important information because the question’s embeddings don’t find the correct source embeddings.
Infinite Context RAG
The goal with Lamborghini RAG would be to mimic what I think the future of infinite context window LLMs would looks like.
So how do we mimic infinite context?
Only use LLMs.
For every question the user asks we pass every single source through an LLM. This would be exceedingly expensive but almost guarantees that the LLM will see the “correct” context required to fully answer any question.
Example: Private Equity Due Diligence
Scenario: Suppose a private equity firm is considering the acquisition of a large B2B enterprise software provider. This target company has thousands of customer contracts, each with unique service-level agreements (SLAs), pricing structures, and termination clauses. Plus, they’ve got a mountain of partnership agreements and vendor contracts. Any single “poison pill” clause (like early termination provisions triggered by a change of control) could blow up the deal’s economics.
领英推荐
So how do you go through every single document and ensure that there are no “poison pills” in these documents. Clearly for this use-case Lamborghini RAG will yield higher accuracy results than Traditional RAG. Let’s compare the cost/latency of using both assuming 10k document at about 10M tokens.
Traditional RAG:
Document Processing with Embedding Models (10M tokens): ~$1.30
LLM Answering (200k context): ~$5
Lamborghini RAG:
Per Document LLM Answer: $20 * 10k = $10,000
Intermediate LLM Synthesis (every 10 documents): $5 * 1k = $5,000
Final LLM Answer Synthesis: $5
So Traditional RAG will cost about $6.30 and have very low latency (~30s), whereas Lamborghini RAG will cost about $15,005 and have extremely high latency (> 60min). But for a diligence workflow like this, a firm could stand to lose millions of dollars (or spend over $1 million contracting a law firm to run the due diligence).
So it’s worth asking, is your use case worth it?
Appendix:
Lamborgini RAG PseudoCode & Flow Chart
I have a hunch this is similar to what Google Deep Research does.
1. Initialize data structures:
all_unanswered_questions = []
questions_answer_pairs = {} # key: question, value: (answer, sources_used)
2. Break down the user’s original question into multiple subquestions:
subquestions = break_down_user_question(user_input)
all_unanswered_questions.extend(subquestions)
3. While there are still unanswered questions:
WHILE all_unanswered_questions IS NOT EMPTY:
current_question = all_unanswered_questions.pop()
FOR each document IN all_documents:
# Check if the document is relevant to the current question
IF is_relevant(document, current_question) IS TRUE:
# Attempt to find an answer to the current_question from this document
answer = extract_answer(document, current_question)
IF answer IS NOT EMPTY:
# Store or update the answer and the source of information
questions_answer_pairs[current_question] = (
answer,
questions_answer_pairs.get(current_question, (None, []))[1] + [document.source_name]
)
ELSE:
# Identify new questions that arise from analyzing this document
new_questions = identify_new_questions(document, current_question)
# Add any new unanswered questions to all_unanswered_questions
FOR question IN new_questions:
IF question NOT IN questions_answer_pairs AND question NOT IN all_unanswered_questions:
all_unanswered_questions.append(question)
4. After processing all documents (until no new unanswered questions appear):
# Synthesize a final response
final_answer = synthesize_answer(questions_answer_pairs)
5. Return or display the final answer, ensuring the response cites all sources used.
OUTPUT final_answer
-
2 个月Very interesting!
CEO of Flint (YC S23)
2 个月Super interesting. Love the poison pill example
Great article Aman !