Lamborghini RAG
DALL-E

Lamborghini RAG

If you had $1 million to build a RAG system for someone how would you do it?

As companies adopt AI into more of their systems, the focus needs to be on reducing hallucinations and increasing accuracy. Since reasoning models are gaining popularity the next wave of AI applications will have very high latency. This is a great opportunity to rethink the v1.0 AI stack. Embedding models are built for low latency applications that can afford the loss of quality.

Agentic workflows have a higher bar for accuracy and are generally more expensive. This gives developers more flexibility when designing their AI stack. To design high accuracy systems I propose Lamborghini RAG: a RAG system that ensures the highest quality output from the LLM given a large corpus of data.

The Problem with Current RAG Systems

Current RAG systems are incredibly lossy and do a poor job at finding the correct answer. Today’s RAG systems use embedding models. Embedding models are great because they are optimize for latency and cost. However, embedding models have tradeoffs because they are incredibly lossy.

The basic RAG system today is simple.

These RAG systems do a pretty decent job at answering questions over data sources, however they frequently miss important information because the question’s embeddings don’t find the correct source embeddings.

Infinite Context RAG

The goal with Lamborghini RAG would be to mimic what I think the future of infinite context window LLMs would looks like.

So how do we mimic infinite context?

Only use LLMs.

For every question the user asks we pass every single source through an LLM. This would be exceedingly expensive but almost guarantees that the LLM will see the “correct” context required to fully answer any question.

Example: Private Equity Due Diligence

Scenario: Suppose a private equity firm is considering the acquisition of a large B2B enterprise software provider. This target company has thousands of customer contracts, each with unique service-level agreements (SLAs), pricing structures, and termination clauses. Plus, they’ve got a mountain of partnership agreements and vendor contracts. Any single “poison pill” clause (like early termination provisions triggered by a change of control) could blow up the deal’s economics.

So how do you go through every single document and ensure that there are no “poison pills” in these documents. Clearly for this use-case Lamborghini RAG will yield higher accuracy results than Traditional RAG. Let’s compare the cost/latency of using both assuming 10k document at about 10M tokens.

Traditional RAG:

Document Processing with Embedding Models (10M tokens): ~$1.30

LLM Answering (200k context): ~$5

Lamborghini RAG:

Per Document LLM Answer: $20 * 10k = $10,000

Intermediate LLM Synthesis (every 10 documents): $5 * 1k = $5,000

Final LLM Answer Synthesis: $5

So Traditional RAG will cost about $6.30 and have very low latency (~30s), whereas Lamborghini RAG will cost about $15,005 and have extremely high latency (> 60min). But for a diligence workflow like this, a firm could stand to lose millions of dollars (or spend over $1 million contracting a law firm to run the due diligence).

So it’s worth asking, is your use case worth it?

Appendix:

Lamborgini RAG PseudoCode & Flow Chart

I have a hunch this is similar to what Google Deep Research does.

1. Initialize data structures:
   all_unanswered_questions = []
   questions_answer_pairs = {}   # key: question, value: (answer, sources_used)

2. Break down the user’s original question into multiple subquestions:
   subquestions = break_down_user_question(user_input)
   all_unanswered_questions.extend(subquestions)

3. While there are still unanswered questions:
   WHILE all_unanswered_questions IS NOT EMPTY:
       current_question = all_unanswered_questions.pop()

       FOR each document IN all_documents:
           # Check if the document is relevant to the current question
           IF is_relevant(document, current_question) IS TRUE:
               
               # Attempt to find an answer to the current_question from this document
               answer = extract_answer(document, current_question)
               
               IF answer IS NOT EMPTY:
                   # Store or update the answer and the source of information
                   questions_answer_pairs[current_question] = (
                       answer, 
                       questions_answer_pairs.get(current_question, (None, []))[1] + [document.source_name]
                   )
               
               ELSE:
                   # Identify new questions that arise from analyzing this document
                   new_questions = identify_new_questions(document, current_question)
                   
                   # Add any new unanswered questions to all_unanswered_questions
                   FOR question IN new_questions:
                       IF question NOT IN questions_answer_pairs AND question NOT IN all_unanswered_questions:
                           all_unanswered_questions.append(question)

4. After processing all documents (until no new unanswered questions appear):
   # Synthesize a final response
   final_answer = synthesize_answer(questions_answer_pairs)

5. Return or display the final answer, ensuring the response cites all sources used.
   OUTPUT final_answer        
Sohan Choudhury

CEO of Flint (YC S23)

2 个月

Super interesting. Love the poison pill example

Great article Aman !

要查看或添加评论,请登录

Aman Kishore的更多文章

  • Market Intelligence From Precedents

    Market Intelligence From Precedents

    Market intelligence has long been the backbone of financial analysis and investment strategy. The ability to study…

  • Reasoning is the New Services

    Reasoning is the New Services

    Reasoning models will fundamentally change how we businesses run. It’s no longer just about automating tasks; now…

    1 条评论
  • Text to 3D Scene Creation

    Text to 3D Scene Creation

    By Mirage ML Inc. Abstract This paper explores an innovative application of GPT-4, an advanced language model by…

    1 条评论
  • Google Duplex - How it works & Implications

    Google Duplex - How it works & Implications

    Imagine a future where every task can be achieved with a simple voice command. Google gave us a sneak peek into what…

    2 条评论

社区洞察

其他会员也浏览了