登录查看更多内容

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2024年9月24日

Today's paper introduces FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a new evaluation dataset for testing retrieval-augmented generation (RAG) systems. FRAMES provides a unified framework to assess RAG systems' ability to retrieve relevant information, reason across multiple documents, and generate factual responses. The dataset comprises challenging multi-hop questions that require integrating information from multiple sources.

Overview

FRAMES consists of 824 challenging questions that require information from 2-15 Wikipedia articles to answer correctly. The questions are designed to test multiple aspects of RAG systems, including factual accuracy, retrieval capabilities, and complex reasoning.

The dataset was created through a combination of synthetic data generation attempts and human annotation. Initially, they experimented with using large language models to generate questions, but this approach resulted in a high proportion of hallucinated content. Therefore, they pivoted to using human annotators to create high-quality questions based on information from multiple Wikipedia articles.

The questions in FRAMES cover various reasoning types, including numerical reasoning, tabular reasoning, multiple constraints, temporal reasoning, and post-processing. Each question is carefully crafted to require the integration of information from multiple sources, simulating real-world query scenarios.

To ensure the dataset's quality and effectiveness, they implemented several quality checks. These included verifying the correctness of answers, adding temporal disambiguation to prevent ambiguity due to changing information, ensuring a large output space to prevent guesswork, and addressing potential contamination issues by designing questions that require additional reasoning beyond simple fact retrieval.

Towards Data Science 5 个月前

Exploring Named Entity Recognition use cases across…

Naveen Joshi 4 年前

?? The Downsides of Structured Outputs

Pascal Biese 3 个月前

Results

The paper presents baseline results using state-of-the-art language models like Gemini-Pro and Gemma. In single-step evaluations, models achieved relatively low accuracy (around 40-47%) when given the question alone or with limited retrieved context. However, performance improved significantly (up to 72% accuracy) when provided with all relevant Wikipedia articles.

Multi-step evaluations, where models iteratively retrieve and reason across multiple steps, showed promising results. By implementing a search planning strategy, model performance reached 66% accuracy, approaching the oracle performance of 73%. This demonstrates the potential for improving RAG systems through more sophisticated retrieval and reasoning strategies.

Conclusion

FRAMES provides a unified framework that tests factuality, retrieval, and reasoning capabilities simultaneously. The challenging nature of the dataset highlights current limitations in state-of-the-art models. For more information please consult the?full paper .

Congrats to the authors for their work!

Krishna, Satyapriya, et al. "FACT, FETCH, AND REASON: A UNIFIED EVALUATION OF RETRIEVAL-AUGMENTED GENERATION." arXiv preprint arXiv:2409.12941 (2024).

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Overview

领英推荐

Results

Conclusion

AI Paper of the Day

1,026 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

??Top ML Papers of the Week

A Complete Guide to Creating and Storing Vector Embeddings!

Building Retrieval Augmented Generation (RAG) from scratch - Feeding my Database Internal articles

Optimizing Retrieval in Retriever Augmented Generation (RAG)

???????????? ?????????????????? ?????? ?????? ????????????????????????

Retrieval-Augmented Generation (RAG) vs. LLM Fine-Tuning: Navigating the Trade-Offs for Optimal LLM Performance

A Comprehensive Guide to Building Multimodal RAG Systems

Overview

领英推荐

Results

Conclusion

AI Paper of the Day

1,026 位关注者

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

2024年11月22日

AnimateAnything: Consistent and Controllable Animation for Video Generation

2024年11月21日

RedPajama: an Open Dataset for Training Large Language Models

2024年11月20日

Generative World Explorer

2024年11月19日

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

2024年11月18日

Cut Your Losses in Large-Vocabulary Language Models

2024年11月17日

Stronger Models are NOT Stronger Teachers for Instruction Tuning

2024年11月16日

FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?

2024年11月15日

Qwen2.5-Coder Technical Report

2024年11月14日

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

2024年11月13日

社区洞察

其他会员也浏览了

??Top ML Papers of the Week

A Complete Guide to Creating and Storing Vector Embeddings!

Building Retrieval Augmented Generation (RAG) from scratch - Feeding my Database Internal articles

Optimizing Retrieval in Retriever Augmented Generation (RAG)

???????????? ?????????????????? ?????? ?????? ????????????????????????

Retrieval-Augmented Generation (RAG) vs. LLM Fine-Tuning: Navigating the Trade-Offs for Optimal LLM Performance

A Comprehensive Guide to Building Multimodal RAG Systems