Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
Today's paper introduces FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a new evaluation dataset for testing retrieval-augmented generation (RAG) systems. FRAMES provides a unified framework to assess RAG systems' ability to retrieve relevant information, reason across multiple documents, and generate factual responses. The dataset comprises challenging multi-hop questions that require integrating information from multiple sources.
Overview
FRAMES consists of 824 challenging questions that require information from 2-15 Wikipedia articles to answer correctly. The questions are designed to test multiple aspects of RAG systems, including factual accuracy, retrieval capabilities, and complex reasoning.
The dataset was created through a combination of synthetic data generation attempts and human annotation. Initially, they experimented with using large language models to generate questions, but this approach resulted in a high proportion of hallucinated content. Therefore, they pivoted to using human annotators to create high-quality questions based on information from multiple Wikipedia articles.
The questions in FRAMES cover various reasoning types, including numerical reasoning, tabular reasoning, multiple constraints, temporal reasoning, and post-processing. Each question is carefully crafted to require the integration of information from multiple sources, simulating real-world query scenarios.
To ensure the dataset's quality and effectiveness, they implemented several quality checks. These included verifying the correctness of answers, adding temporal disambiguation to prevent ambiguity due to changing information, ensuring a large output space to prevent guesswork, and addressing potential contamination issues by designing questions that require additional reasoning beyond simple fact retrieval.
领英推荐
Results
The paper presents baseline results using state-of-the-art language models like Gemini-Pro and Gemma. In single-step evaluations, models achieved relatively low accuracy (around 40-47%) when given the question alone or with limited retrieved context. However, performance improved significantly (up to 72% accuracy) when provided with all relevant Wikipedia articles.
Multi-step evaluations, where models iteratively retrieve and reason across multiple steps, showed promising results. By implementing a search planning strategy, model performance reached 66% accuracy, approaching the oracle performance of 73%. This demonstrates the potential for improving RAG systems through more sophisticated retrieval and reasoning strategies.
Conclusion
FRAMES provides a unified framework that tests factuality, retrieval, and reasoning capabilities simultaneously. The challenging nature of the dataset highlights current limitations in state-of-the-art models. For more information please consult the?full paper .
Congrats to the authors for their work!
Krishna, Satyapriya, et al. "FACT, FETCH, AND REASON: A UNIFIED EVALUATION OF RETRIEVAL-AUGMENTED GENERATION." arXiv preprint arXiv:2409.12941 (2024).