登录查看更多内容

Google DeepMind investigated inference scaling for long-context RAG

TuringPost

Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??

发布日期: 2024年10月28日

+ 关注

Google DeepMind explored how to scale inference in RAG effectively:

- They introduced new DRAG and IterDRAG strategies

- Discovered “inference scaling laws” for RAG

- Developed a model to predict optimal RAG settings based on computing power

Here are the details:

Demonstration-Based RAG (DRAG)

In DRAG, the input expands by adding examples and relevant documents to the prompt. It retrieves top-ranked documents (e.g., from Wikipedia) and organizes them by importance. This setup provides rich context for generating answers in one step.

Iterative Demonstration-Based RAG

IterDRAG is used for questions that need multiple steps to answer. It breaks down complex queries into manageable parts. The model is prompted to generate the steps itself, adding documents and answers as it works through each sub-query.

Scaling advantage:

When original RAG improves up to 128k tokens and then levels off, DRAG keeps improving up to 1M tokens, and IterDRAG up to 5M tokens.

DRAG performs better with shorter budgets (16k and 32k), while IterDRAG is more effective at larger scales (128k and beyond).

Inference scaling laws for RAG:

- Linear growth: As the computation increases, RAG performance improves almost in a straight line.

- For budgets over 100,000 tokens, IterDRAG has steady improvement using resources effectively beyond 128k tokens.

- Diminishing returns beyond 1M: Performance gains slows down between 1M and 5M tokens.

The computation allocation model for RAG:

It boosts performance by choosing the best settings (documents, examples, iterations) for the available context length. The model excels with contexts under 1M tokens and generalizes well, but accuracy drops at 5M tokens.

Original paper: https://arxiv.org/pdf/2410.04343

Google DeepMind investigated inference scaling for long-context RAG

TuringPost

Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??

Turing Post

2,213 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

OpenAI: How to Build LLM Apps that can See, Hear, Speak

Learn how to evaluate and score results from GPT-like systems

Artificial Intelligence #26

GPT: Developer Tips, Tricks & Techniques

LangChain for Multimodal Apps: Chat with Text/Image Data

New Machine Learning Optimization Technique - Part I

Issue #161 - THE ML ENGINEER ??

Vector RAG w/o fine tuned LLM

Memory, Planning, Thinking, o1

“OpenAI’s o1: Smarter, Slower, and Expensive – But Worth the Wait?”

Turing Post

2,213 位关注者

TüLU 3: not just a model

2024年11月29日

NLRL: Natural Language Reinforcement Learning redefines Reinforcement Learning.

2024年11月28日

Topic 19: Inside LLaVA-o1

2024年11月28日

Hymba small model: a great combo of 2 concepts

2024年11月28日

??#77: Amid Big Model Chaos: Small Models and Embeddings Steal the Spotlight

2024年11月26日

????#5: Building Blocks of Agentic Systems

2024年11月25日

SAMURAI model for perfect segmenting and tracking objects in videos

2024年11月25日

Concepts: Supervised, Semi-Supervised, Self-Supervised, Unsupervised types of Machine Learning

2024年11月23日

FastRAG for semi-structured data

2024年11月23日

SEALONG, self-impovement approach for long context reasoning

2024年11月22日

社区洞察

其他会员也浏览了

OpenAI: How to Build LLM Apps that can See, Hear, Speak

Learn how to evaluate and score results from GPT-like systems

Artificial Intelligence #26

GPT: Developer Tips, Tricks & Techniques

LangChain for Multimodal Apps: Chat with Text/Image Data

New Machine Learning Optimization Technique - Part I

Issue #161 - THE ML ENGINEER ??

Vector RAG w/o fine tuned LLM

Memory, Planning, Thinking, o1

“OpenAI’s o1: Smarter, Slower, and Expensive – But Worth the Wait?”