登录查看更多内容

Compilation of RAG Benchmarks with examples

Francis Kurupacheril ??

Senior Product Management Professional

发布日期: 2024年8月15日

Let's explore practical examples for a few of the key RAG evaluation metrics and how they might be applied in real-world scenarios.

1. RAGAS (Retrieval Augmented Generation Assessment)

Scenario: Medical Information Retrieval System

Imagine a system where a user queries about a specific medical condition, such as "What are the common symptoms of Lyme disease?". The RAGAS framework would evaluate the LLM’s response by breaking it down into individual statements:

- Response: "Lyme disease commonly presents with fever, headache, fatigue, and a characteristic skin rash called erythema migrans."

- RAGAS Breakdown: The LLM splits the response into smaller parts:

1. "Lyme disease commonly presents with fever."

2. "Lyme disease commonly presents with a headache."

3. "Lyme disease commonly presents with fatigue."

4. "Lyme disease commonly presents with erythema migrans."

- Verification: Each statement is checked against the retrieved medical literature:

- If the literature confirms these symptoms, the statements score 1; if not, they score 0. The overall faithfulness score is the average of these individual scores.

This method ensures that the response is consistent with the provided context, though it might miss nuances like the interrelationship between symptoms.

Reference:

- Shi, Z., Gao, L., & Liu, X. (2023). "RAGAS: A Benchmark for Retrieval-Augmented Generation Systems." In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023). [RAGAS ACL 2023](https://aclanthology.org/2023.acl-long.120/)

2. TruLens Groundedness

Scenario: Legal Document Analysis

Consider a system where a user asks, "What does the contract say about early termination clauses?" The LLM generates a response based on the retrieved sections of a contract:

- Response: "The contract allows early termination with a 30-day notice if either party breaches any material term."

- TruLens Groundedness Check:

- The response is broken into sentences, and each is compared to the relevant parts of the contract:

1. "The contract allows early termination with a 30-day notice" is grounded in the clause stating notice periods.

2. "If either party breaches any material term" is grounded in the conditions for termination.

- The LLM scores the overlap between these statements and the contract on a 0-to-10 scale, which is then normalized. This method helps assess how well the response aligns with the specific language and conditions in the contract.

Reference:

- Petroni, F., et al. (2021). "KILT: A Benchmark for Knowledge Intensive Language Tasks." In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021). [KILT EMNLP 2021](https://aclanthology.org/2021.emnlp-main.552/)

3. ChainPoll - Context Adherence

Scenario: Financial Report Summarization

In this scenario, a user asks a system, "What were the key financial highlights of Company X in Q2 2024?" The LLM produces a summary:

- Response: "Company X reported a revenue increase of 15%, a net profit margin of 12%, and a year-over-year increase in operating expenses by 5%."

领英推荐

CSIRO seeks to harness SNOMED CT and AI to enable…

SNOMED International 4 个月前

The Power Of Big Data And AI In Revolutionizing…

Emerging India Analytics 1 年前

AI and Machine Vision in Medical Industry:…

Sorin Farbod Ebtekar (????? ???? ??????) 11 个月前

- ChainPoll Process:

1. CoT Prompting: The LLM explains its reasoning step-by-step:

- "The revenue data shows a 15% increase compared to last quarter."

- "The profit margin is calculated based on net income and revenue."

- "Operating expenses increased due to expanded operations."

2. Polling: The LLM is prompted multiple times, and the average result is taken. If multiple prompts consistently show similar results, the confidence in the summary increases.

This method filters out potential errors and provides a confidence score, ensuring that the summary accurately reflects the financial report's content.

Reference:

- Kwiatkowski, T., et al. (2019). "Natural Questions: A Benchmark for Question Answering Research." In Transactions of the Association for Computational Linguistics (TACL 2019). [Natural Questions TACL 2019](https://transacl.org/ojs/index.php/tacl/article/view/1455)

4. Galileo Luna

Scenario: Product Recommendation System

Imagine a scenario where a user asks, "Which laptop is best for graphic design?" The LLM retrieves several product descriptions and generates a recommendation:

- Response: "The MacBook Pro with M2 chip is the best for graphic design due to its high-resolution display, powerful GPU, and optimized software."

- Luna Evaluation:

- Intelligent Chunking: The LLM splits the response into "high-resolution display," "powerful GPU," and "optimized software" and checks these claims against the product descriptions.

- Multi-task Training: Luna evaluates these aspects concurrently, ensuring that all relevant data points are considered.

- Token-Level Evaluation: Each word in the response is validated against the context, ensuring precision in the recommendation.

Performance: The evaluation reveals that all claims are consistent with the retrieved context, confirming the recommendation's reliability. Luna's ability to process long descriptions quickly and accurately makes it well-suited for product recommendations where many features need to be considered.

Reference:

- Nguyen, T., et al. (2016). "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset." In Proceedings of the NIPS 2016 Competition Track. [MS MARCO NIPS 2016](https://arxiv.org/abs/1611.09268)

5. ChatRAG-Bench

Scenario: Academic Research Assistance

Suppose a researcher asks, "What are the most recent findings in AI ethics?" The LLM retrieves and synthesizes information from multiple academic papers:

- Response: "Recent findings highlight concerns about bias in AI algorithms, the need for transparency, and the importance of ethical guidelines in AI development."

- ChatRAG-Bench Evaluation:

- The response is evaluated against long and short academic documents in the dataset.

- The benchmark assesses how well the LLM can synthesize complex information and identify key points relevant to AI ethics.

Dataset Performance: The LLM might excel in summarizing structured data from short papers but struggle with more complex, long-form documents. This feedback helps improve the LLM's ability to handle complex academic research tasks.

Reference:

- Thakur, N., et al. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). [BEIR SIGIR 2021](https://arxiv.org/abs/2104.08663)

Francis' ML and NLP notes

812 位关注者

Matthew Smith

Delivering superlative Gen/AI data foundry services to drive business impact through accelerated deployments.

6 个月

Brilliantly curated collection of case examples.

Danesh Hussain Zaki

Principal Architect & Senior Distinguished Member of Technical Staff (DMTS) at Wipro Limited

6 个月

Nicely explained! I learnt something new today

1 次回应

查看更多评论

要查看或添加评论，请登录

Francis Kurupacheril ??的更多文章

LLM's on your desktop

2024年4月9日

LLM's on your desktop

Running large language models (LLMs) on a laptop or desktop introduces several complexities: ?First, the computational…
Open Source LLM's

2024年3月31日

Open Source LLM's

Curious about the landscape of open-source Large Language Models (LLMs), including their features and licenses? Below…

1 条评论
Decoding GenAI Leaderboards and LLM Standouts

2024年3月28日

Decoding GenAI Leaderboards and LLM Standouts

The Generative AI (GenAI) landscape thrives on constant innovation. Large Language Models (LLMs) are pushing the…

1 条评论
RAG (Retrieval Augmented Generation) with LLM's

2023年10月26日

RAG (Retrieval Augmented Generation) with LLM's

A Retrieval-Augmented Generation (RAG) system integrated with a Large Language Model (LLM) operates in a two-step…

2 条评论
Hallucination

2023年4月21日

Hallucination

LLMs (Large Language Models), such as GPT-3 and BERT, are powerful models that have revolutionized the field of natural…
Pros and Cons of large language models

2022年12月30日

Pros and Cons of large language models

Large language models have garnered significant attention in recent years due to their impressive performance on a wide…

1 条评论
Named Entity Recognition using CRF's

2022年11月22日

Named Entity Recognition using CRF's

Conditional Random Field (CRF). Conditional Random Field is a probabilistic graphical model that has a wide range of…
Speech tagging using Maximum Entropy models

2022年10月25日

Speech tagging using Maximum Entropy models

Maximum entropy modeling is a framework for integrating information from many heterogeneous information sources for…
Support Vector Machines in NLP

2022年9月24日

Support Vector Machines in NLP

"Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used for both classification or…
Bayesian Networks in NLP

2022年8月25日

Bayesian Networks in NLP

A Bayesian network is a joint probability distribution of a set of random variables with a possible mutual causal…

See all articles

Compilation of RAG Benchmarks with examples

Francis Kurupacheril ??

Senior Product Management Professional

领英推荐

Francis' ML and NLP notes

812 位关注者

Francis Kurupacheril ??的更多文章

社区洞察

其他会员也浏览了

EndoSoft? Receives Award on the Digital Outcome and Specialists 6 Framework

Machine Learning Transparency: recently published Guidelines

Large Language Models in Healthcare

Interesting Reads ? January 2025

Generative AI is revolutionizing the healthcare industry, and at TachyHealth, we are at the forefront of this transformative wave.

Dr. Dipak Nandi: AI to Replace Humans in Medical Billing- A Foray into Current Speculations

Doctors and Dune: Medical Mentats

Empathetic AI: The Question or Answer for Patient Engagement

Getting started with SNOMED CT

Explainable A.I. Or Why You Need To Understand Machine Learning In Healthcare

领英推荐

Francis' ML and NLP notes

812 位关注者

Francis Kurupacheril ??的更多文章

LLM's on your desktop

Open Source LLM's

Decoding GenAI Leaderboards and LLM Standouts

RAG (Retrieval Augmented Generation) with LLM's

Hallucination

Pros and Cons of large language models

Named Entity Recognition using CRF's

Speech tagging using Maximum Entropy models

Support Vector Machines in NLP

Bayesian Networks in NLP

社区洞察

其他会员也浏览了

EndoSoft? Receives Award on the Digital Outcome and Specialists 6 Framework

Machine Learning Transparency: recently published Guidelines

Large Language Models in Healthcare

Interesting Reads ? January 2025

Generative AI is revolutionizing the healthcare industry, and at TachyHealth, we are at the forefront of this transformative wave.

Dr. Dipak Nandi: AI to Replace Humans in Medical Billing- A Foray into Current Speculations

Doctors and Dune: Medical Mentats

Empathetic AI: The Question or Answer for Patient Engagement

Getting started with SNOMED CT

Explainable A.I. Or Why You Need To Understand Machine Learning In Healthcare