Compilation of RAG Benchmarks with examples

Compilation of RAG Benchmarks with examples

Let's explore practical examples for a few of the key RAG evaluation metrics and how they might be applied in real-world scenarios.


1. RAGAS (Retrieval Augmented Generation Assessment)

Scenario: Medical Information Retrieval System

Imagine a system where a user queries about a specific medical condition, such as "What are the common symptoms of Lyme disease?". The RAGAS framework would evaluate the LLM’s response by breaking it down into individual statements:

- Response: "Lyme disease commonly presents with fever, headache, fatigue, and a characteristic skin rash called erythema migrans."

- RAGAS Breakdown: The LLM splits the response into smaller parts:

1. "Lyme disease commonly presents with fever."

2. "Lyme disease commonly presents with a headache."

3. "Lyme disease commonly presents with fatigue."

4. "Lyme disease commonly presents with erythema migrans."

- Verification: Each statement is checked against the retrieved medical literature:

- If the literature confirms these symptoms, the statements score 1; if not, they score 0. The overall faithfulness score is the average of these individual scores.

This method ensures that the response is consistent with the provided context, though it might miss nuances like the interrelationship between symptoms.

Reference:

- Shi, Z., Gao, L., & Liu, X. (2023). "RAGAS: A Benchmark for Retrieval-Augmented Generation Systems." In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023). [RAGAS ACL 2023](https://aclanthology.org/2023.acl-long.120/)


2. TruLens Groundedness

Scenario: Legal Document Analysis

Consider a system where a user asks, "What does the contract say about early termination clauses?" The LLM generates a response based on the retrieved sections of a contract:

- Response: "The contract allows early termination with a 30-day notice if either party breaches any material term."

- TruLens Groundedness Check:

- The response is broken into sentences, and each is compared to the relevant parts of the contract:

1. "The contract allows early termination with a 30-day notice" is grounded in the clause stating notice periods.

2. "If either party breaches any material term" is grounded in the conditions for termination.

- The LLM scores the overlap between these statements and the contract on a 0-to-10 scale, which is then normalized. This method helps assess how well the response aligns with the specific language and conditions in the contract.

Reference:

- Petroni, F., et al. (2021). "KILT: A Benchmark for Knowledge Intensive Language Tasks." In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021). [KILT EMNLP 2021](https://aclanthology.org/2021.emnlp-main.552/)


3. ChainPoll - Context Adherence

Scenario: Financial Report Summarization

In this scenario, a user asks a system, "What were the key financial highlights of Company X in Q2 2024?" The LLM produces a summary:

- Response: "Company X reported a revenue increase of 15%, a net profit margin of 12%, and a year-over-year increase in operating expenses by 5%."

- ChainPoll Process:

1. CoT Prompting: The LLM explains its reasoning step-by-step:

- "The revenue data shows a 15% increase compared to last quarter."

- "The profit margin is calculated based on net income and revenue."

- "Operating expenses increased due to expanded operations."

2. Polling: The LLM is prompted multiple times, and the average result is taken. If multiple prompts consistently show similar results, the confidence in the summary increases.

This method filters out potential errors and provides a confidence score, ensuring that the summary accurately reflects the financial report's content.

Reference:

- Kwiatkowski, T., et al. (2019). "Natural Questions: A Benchmark for Question Answering Research." In Transactions of the Association for Computational Linguistics (TACL 2019). [Natural Questions TACL 2019](https://transacl.org/ojs/index.php/tacl/article/view/1455)


4. Galileo Luna

Scenario: Product Recommendation System

Imagine a scenario where a user asks, "Which laptop is best for graphic design?" The LLM retrieves several product descriptions and generates a recommendation:

- Response: "The MacBook Pro with M2 chip is the best for graphic design due to its high-resolution display, powerful GPU, and optimized software."

- Luna Evaluation:

- Intelligent Chunking: The LLM splits the response into "high-resolution display," "powerful GPU," and "optimized software" and checks these claims against the product descriptions.

- Multi-task Training: Luna evaluates these aspects concurrently, ensuring that all relevant data points are considered.

- Token-Level Evaluation: Each word in the response is validated against the context, ensuring precision in the recommendation.

Performance: The evaluation reveals that all claims are consistent with the retrieved context, confirming the recommendation's reliability. Luna's ability to process long descriptions quickly and accurately makes it well-suited for product recommendations where many features need to be considered.

Reference:

- Nguyen, T., et al. (2016). "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset." In Proceedings of the NIPS 2016 Competition Track. [MS MARCO NIPS 2016](https://arxiv.org/abs/1611.09268)


5. ChatRAG-Bench

Scenario: Academic Research Assistance

Suppose a researcher asks, "What are the most recent findings in AI ethics?" The LLM retrieves and synthesizes information from multiple academic papers:

- Response: "Recent findings highlight concerns about bias in AI algorithms, the need for transparency, and the importance of ethical guidelines in AI development."

- ChatRAG-Bench Evaluation:

- The response is evaluated against long and short academic documents in the dataset.

- The benchmark assesses how well the LLM can synthesize complex information and identify key points relevant to AI ethics.

Dataset Performance: The LLM might excel in summarizing structured data from short papers but struggle with more complex, long-form documents. This feedback helps improve the LLM's ability to handle complex academic research tasks.

Reference:

- Thakur, N., et al. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). [BEIR SIGIR 2021](https://arxiv.org/abs/2104.08663)

Matthew Smith

Delivering superlative Gen/AI data foundry services to drive business impact through accelerated deployments.

6 个月

Brilliantly curated collection of case examples.

回复
Danesh Hussain Zaki

Principal Architect & Senior Distinguished Member of Technical Staff (DMTS) at Wipro Limited

6 个月

Nicely explained! I learnt something new today

要查看或添加评论,请登录

Francis Kurupacheril ??的更多文章

  • LLM's on your desktop

    LLM's on your desktop

    Running large language models (LLMs) on a laptop or desktop introduces several complexities: ?First, the computational…

  • Open Source LLM's

    Open Source LLM's

    Curious about the landscape of open-source Large Language Models (LLMs), including their features and licenses? Below…

    1 条评论
  • Decoding GenAI Leaderboards and LLM Standouts

    Decoding GenAI Leaderboards and LLM Standouts

    The Generative AI (GenAI) landscape thrives on constant innovation. Large Language Models (LLMs) are pushing the…

    1 条评论
  • RAG (Retrieval Augmented Generation) with LLM's

    RAG (Retrieval Augmented Generation) with LLM's

    A Retrieval-Augmented Generation (RAG) system integrated with a Large Language Model (LLM) operates in a two-step…

    2 条评论
  • Hallucination

    Hallucination

    LLMs (Large Language Models), such as GPT-3 and BERT, are powerful models that have revolutionized the field of natural…

  • Pros and Cons of large language models

    Pros and Cons of large language models

    Large language models have garnered significant attention in recent years due to their impressive performance on a wide…

    1 条评论
  • Named Entity Recognition using CRF's

    Named Entity Recognition using CRF's

    Conditional Random Field (CRF). Conditional Random Field is a probabilistic graphical model that has a wide range of…

  • Speech tagging using Maximum Entropy models

    Speech tagging using Maximum Entropy models

    Maximum entropy modeling is a framework for integrating information from many heterogeneous information sources for…

  • Support Vector Machines in NLP

    Support Vector Machines in NLP

    "Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used for both classification or…

  • Bayesian Networks in NLP

    Bayesian Networks in NLP

    A Bayesian network is a joint probability distribution of a set of random variables with a possible mutual causal…

社区洞察

其他会员也浏览了