Google DeepMind investigated inference scaling for long-context RAG
TuringPost
Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??
Google DeepMind explored how to scale inference in RAG effectively:
- They introduced new DRAG and IterDRAG strategies
- Discovered “inference scaling laws” for RAG
- Developed a model to predict optimal RAG settings based on computing power
Here are the details:
In DRAG, the input expands by adding examples and relevant documents to the prompt. It retrieves top-ranked documents (e.g., from Wikipedia) and organizes them by importance. This setup provides rich context for generating answers in one step.
IterDRAG is used for questions that need multiple steps to answer. It breaks down complex queries into manageable parts. The model is prompted to generate the steps itself, adding documents and answers as it works through each sub-query.
Scaling advantage:
When original RAG improves up to 128k tokens and then levels off, DRAG keeps improving up to 1M tokens, and IterDRAG up to 5M tokens.
DRAG performs better with shorter budgets (16k and 32k), while IterDRAG is more effective at larger scales (128k and beyond).
- Linear growth: As the computation increases, RAG performance improves almost in a straight line.
- For budgets over 100,000 tokens, IterDRAG has steady improvement using resources effectively beyond 128k tokens.
- Diminishing returns beyond 1M: Performance gains slows down between 1M and 5M tokens.
It boosts performance by choosing the best settings (documents, examples, iterations) for the available context length. The model excels with contexts under 1M tokens and generalizes well, but accuracy drops at 5M tokens.
Original paper: https://arxiv.org/pdf/2410.04343