登录查看更多内容

Use-case based evaluation of LLMs

Debmalya Biswas

AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

发布日期: 2024年7月21日

Introduction

We are at a critical juncture in the Generative AI adoption journey, where we are have started hearing conflicting views reg. the transformative potential of Gen AI.?

Large Language Model (LLM) providers, e.g., Open AI, Mistral, Google, Meta, etc. are rolling out one LLM after another?—?with every iteration smaller and more efficient than the previous one. But these are generic pre-trained LLMs without a clear business use-case in mind, or let’s say the business specific use-cases still need to be developed on top of these foundational LLMs. So these LLMs are only an enabler and not a measure of business impact by any means. We do of course have the hyperscalers and technology vendors touting the hundreds (or thousands) of LLM based use-cases that they have already implemented with quantified business value.

On the other hand, we are seeing enterprises / experts start to take a more “pessimistic” view on Gen AI. For example, the recent report by Goldman Sachs is a case in point. The title Gen AI: Too much Spend, Too little Benefit? is self-explanatory and I won’t go into details?—?suffice it to say that while nobody is dismissing the future potential of Gen AI, they are not seeing Gen AI (as of now) solve any complex business strategic problems.

One of the problems here is clearly that there is a lot of exploration / PoCs happening?—?without the PoCs moving into Production. According to some studies (e.g. Forbes, Everest), the percentage of Gen AI PoCs failing is as high as 80%–90%.?TruEra also highlighted this aspect in a recent study, where they surmised that "Only 11% of enterprises had moved more than 25% of their GenAI initiatives into production." They noted the need for continuous and programmatic LLM evaluation (& LLM Observability) as enterprises seek to move more of their LLM use-cases into production.

We argue that that one of the key reasons for this failure is a lack of a comprehensive LLM evaluation strategy for the PoCs, with targeted success metrics specific to the the use-cases.

The situation seems very similar to that of the seminal MLOps paper Hidden Technical Debt in Machine Learning Systems where researchers highlighted that training ML models forms only a small part of the overall ML training-to-deployment lifecycle. In the same way, assessing capabilities of the foundational LLMs is only a small part of performing use-case specific LLM evaluation of enterprise use-cases.

Fig: Enterprise use-case vs. foundational LLM evaluation

In this article, we take the first steps towards defining a comprehensive LLM evaluation strategy focused on enterprise use-cases. It is a multi-faceted problem with the need to design use-case specific validation tests covering both functional and non-functions metrics, taking into account the underlying LLM, solution architecture (RAG, fine-tuning), applicable regulations and enterprise Responsible AI guidelines / policies.

LLM Evaluation Strategy

A comprehensive LLM evaluation strategy is key to moving the developed solution from PoC to Production. It consists of the below 4 overlapping (and sometimes conflicting) evaluation criteria:

Response accuracy and relevance
User experience: improving user satisfaction
Cost containment and Energy efficiency
Adherence to Responsible AI guidelines and Regulatory compliance

(Current) LLM Evaluation Methodologies

There are primarily 3 types of LLM evaluation methodologies prevalent today:

Generic benchmarks and datasets
LLM-as-a-Judge
Manual evaluation

Let us first consider publicly available LLM leaderboards, e.g. Hugging Face Open LLM Leaderboard. While useful, they primarily focus on testing pre-trained LLMs on generic NLP tasks (e.g., Q&A, reasoning, sentence completion) using public datasets, e.g.

?SQuaD 2.0?—?Q&A
Alpaca Eval? —?Instruction following
?GLUE?—?Natural Language Understanding (NLU) tasks
MMLU?—?Multi-task Language Understanding?
DecodingTrust?—?Responsible AI dimensions, the framework underlying HuggingFace’s LLM Safety Leaderboard

The key limitation here is that these leaderboards focus on assessing foundational (pre-trained) LLMs on generic NLP tasks. Enterprise use-case contextualization entails further applying RAG / fine-tuning the pre-trained LLMs with enterprise data. Hence, these generic benchmarking results cannot be applied as-is (are insufficient) to perform use-case specific LLM evaluation.

The LLM-as-a-Judge method uses an “evaluation” LLM (another pre-trained LLM) to evaluate the quality of responses of the target LLM, scoring them using methods like LangChain’s CriteriaEvalChain. Unfortunately, the use-case specific limitations persist in this case as well. It has the advantage of accelerating the LLM evaluation process, though (in most cases) at a higher cost given the use of a second LLM.

The last alternative is (outsourced) manual validation with the help of (enterprise) SMEs. While it can work as a fallback option, it has (high) cost and effort implications, needs to be planned taking into account SME availability, and performed in a standardized manner to accommodate human bias.?

Enterprise Use-case specific LLM Evaluation Strategy

In this section, we focus on the use-case specific LLM evaluation strategy. The point is that if the enterprise use-case is Finance, Legal, HR, etc. related; we need to design an evaluation strategy taking into account the underlying domain data, (sub-)topics, user queries, performance metrics, and regulatory requirements, etc. of the respective use-case.

For example, in a Contact Center context (one of the areas with highest Gen AI adoption today),

summarization use-cases can vary widely, from condensing customer complaints, to outlining the outcomes of sales calls, to extracting the values of subjects mentioned in the call.
Call Center transcripts also suffer from incomplete calls and conversations spanning multiple topics. In calls discussing multiple subjects, LLMs may unintentionally leave out important information - impacting the completeness of the summary.
From a conversation perspective, summarizing a technical support call requires a different understanding and focus, as compared to summarizing a product inquiry call.

领英推荐

Intelligent Document Processing Just Got a Whole Lot…

Pascal BORNET 1 年前

Exploring Comprehensive Steps for Government-Level…

Dr. Rami Shaheen 3 个月前

GenAI-Direct Preference Optimization (DPO): A…

Anand Ramachandran 6 个月前

Given this, there is a need to design a Contact Center use-case specific LLM evaluation strategy taking into account the semantic context and distribution of the generated responses.

We start with a focussed discussion on use-case specific evaluation of LLM Accuracy.

LLM use-case accuracy (with respect to Hallucination)

LLM hallucination has (rightfully) been one of the most significant roadblocks to LLM productization. Many statistical metrics have been proposed to quantify the model accuracy:

Perplexity: This quantifies how a model predicts a text?—?lower the score better the model.
BLEU (Bilingual Evaluation Understudy): BLEU is a metric commonly used in machine translation tasks. It compares the generated output with one or more reference translations and measures the similarity between them. The higher the score better the model.
ROUGE (Recall-Oriented Understudy for Gissing Evaluation): ROUGE is a set of metrics used for evaluating the quality of summaries. It compares the generated summary with one or more reference summaries and calculates precision, recall, and F1-score. The higher the F1-Score, the better the model.

Use-case

Let us consider an AI Product that provides summary of the user response of a campaign in social media based on the input provided by the Marketing manager.

User Input: Can you please provide the insights where we can do better for the upcoming campaign, especially, for chocolates?

LLM Output: For your chocolate product campaign, highlight its unique flavor profiles through visually appealing and mouth-watering imagery on social media platforms. Consider partnering with popular food bloggers or influencers to create enticing content and reach a wider audience. Offer limited-time promotions or exclusive discounts to create a sense of urgency and drive sales.

For the above example, Perplexity will be low, BLEU and ROGUE will be high.

Correctness vs. Groundedness

Broadly, the LLM use-case accuracy can be measured in terms of:

Correctness: refers to the factual accuracy of the LLM’s response
Groundedness: refers to the relationship between the LLM’s response and its underlying KB.?

Studies have shown how a response could be correct, but still improperly grounded.?

This might happen when retrieval results are irrelevant, but the solution somehow manages to produce the correct answer, falsely asserting that an unrelated document supports its conclusion.?

Let’s face it: Hallucinations are an inherent characteristic of any text generation system. Without generation (and by that logic hallucination), an LLM will simply become a retrieval system without the ability to generate any ‘new’ text.

Limitations of RAGs with respect to resolving LLM Hallucinations

RAGs have been widely promoted as a solution to hallucinations. Unfortunately, while RAGs can limit hallucinations, they cannot totally eliminate hallucinations.

In a recent study, Manning et. al. highlighted the limitations of RAGs for Legal use-cases. In a legal setting, there are primarily 3 ways a model can hallucinate:?

it can be unfaithful to its training data,?
unfaithful to its prompt input,?
or unfaithful to the true facts of the world.

They focus on factual hallucination and highlight many retrieval challenges specific to the legal domain, e.g.,

Legal queries often do not have a single, clear-cut answer?—?the response is spread over multiple documents across time and location.
Document relevance in the legal context is not based on text similarity alone. In different jurisdictions and in different time periods, the applicable rule or the relevant jurisprudence may differ.

In short, they show that while RAGs can help in reducing the hallucinations of state-of-the-art pre-trained GPT models, they still hallucinate 17% and 33% of the time.

Conclusion

In this article, we showed that a use-case based evaluation of LLMs is critical to LLM productionization in enterprises. We summarized current LLM evaluation strategies that primarily focus on benchmarking pre-trained LLMs on generic NLP tasks. We then outlined a comprehensive evaluation strategy that builds on this foundational LLM evaluation, taking into account the data and conversation related requirements / distribution of the underlying use-cases.?

We primarily focussed on LLM accuracy evaluation with respect to hallucinations in this article. The plan is to extend this into a series of articles covering other LLM evaluation dimensions in the future, starting with the responsible AI metrics, e.g., toxicity, fairness, privacy.

Markus Dueringer

Data & Analysis for Drug Discovery, Diagnostics, Health Care Knowledge Graph, AI, ML, LLM, NLP, Bioinformatics Business Development, Account Management

7 个月

Thanks for sharing

1 次回应

Debmalya Biswas

AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

8 个月

The article is now published in Towards Data Science https://towardsdatascience.com/enterprise-use-case-based-evaluation-of-llms-abcf8292889f

Rémy Fannader

Author of 'Enterprise Architecture Fundamentals', Founder & Owner of Caminao

8 个月

Technologies can only be evaluated through the applications that make use of them. The first step is thus to define the enterprise use cases: https://caminao.blog/overview/knowledge-kaleidoscope/ontologies-use-cases/ And then to be more specific about embedded LLM technologies

2 次回应

Andressa Schiessl

Senior Data & Analytics Manager - Ambev | AI | GEN AI | digital insights

8 个月

Matheus Tocchini Guilherme Garcia Pedro Sergio

2 次回应

查看更多评论

要查看或添加评论，请登录

Debmalya Biswas的更多文章

Agentic AI for Data Engineering

2025年3月23日

Agentic AI for Data Engineering

1. Introduction The discussion around ChatGPT (in general, generative AI), has now evolved into agentic AI.

20 条评论
Agentic AI Pricing Dimensions

2025年3月2日

Agentic AI Pricing Dimensions

1. Generative AI & Agentic AI Architecture patterns Generative AI (Gen AI) solutions are varied in scope, and we expect…

13 条评论
Prompt Stores: Prompt Engineering for the Enterprise

2025年2月22日

Prompt Stores: Prompt Engineering for the Enterprise

Introduction Prompts today are the primary mode of interaction with large language models (LLMs). Prompts need to be…

1 条评论
Gen AI for Security: migrating Legacy Policies to LLM based Risk Classifiers

2025年2月1日

Gen AI for Security: migrating Legacy Policies to LLM based Risk Classifiers

1. Rule based Risk Classification Risk Classification is a core capability of Cyber Security Systems, mapping access…

8 条评论
"Gossiping" AI Agents & their Privacy Implications

2025年1月27日

"Gossiping" AI Agents & their Privacy Implications

Large Language Models (LLMs) enable AI Agents to interact with each other in a natural fashion. This composed with…

10 条评论
Federated Learning for Composite AI?Agents

2025年1月19日

Federated Learning for Composite AI?Agents

1. Composite AI Agents The discussion around ChatGPT (in general, Generative AI), has now evolved into Agentic AI.

14 条评论
Personalizing UX for Agentic AI: the last mile of Enterprise Adoption

2025年1月5日

Personalizing UX for Agentic AI: the last mile of Enterprise Adoption

1. Introduction The discussion around ChatGPT (in general, Generative AI), has now evolved into Agentic AI.

7 条评论
AI Agents Marketplace & Discovery for Multi-agent Systems

2024年12月26日

AI Agents Marketplace & Discovery for Multi-agent Systems

1. Introduction to Agentic AI Systems AI agents are the current hype.

12 条评论
Responsible AI Agents: Integrating Responsible AI practices into Agentic?AI

2024年12月22日

Responsible AI Agents: Integrating Responsible AI practices into Agentic?AI

Abstract. While we see growing adoption of both LLMOps / AgentOps and Responsible AI practices at enterprises, the…

7 条评论
Deep Learning based Asbestos Fiber Detection

2024年12月21日

Deep Learning based Asbestos Fiber Detection

Abstract. Airborne respirable fibers, such as asbestos are hazardous to health and occupational health and safety…

5 条评论

See all articles

Use-case based evaluation of LLMs

Debmalya Biswas

AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

Introduction

LLM Evaluation Strategy

(Current) LLM Evaluation Methodologies

Enterprise Use-case specific LLM Evaluation Strategy

领英推荐

LLM use-case accuracy (with respect to Hallucination)

Correctness vs. Groundedness

Limitations of RAGs with respect to resolving LLM Hallucinations

Conclusion

Debmalya Biswas的更多文章

社区洞察

其他会员也浏览了

The AI Contribution Rating System (AICRS): A Framework to Measure AI Involvement

Driving Enterprise Growth: The Power of AI Grounding

Scalable Oversight

Importance of AI alignment for successful implementation

Custom GPTs: The Game-Changer for Modern Businesses

Cracking the Code: How to Craft Killer AI Prompts for Stellar Results

Navigating the Dual Learning Curve: AI and Human Transformation

AI and Me: A 1200-hour Exploration of Machine Intelligence

AI Adoption Is No Longer a Technology Initiative—It’s a Business Imperative

Neuro-Symbolic Approach to Knowledge Graphs

Introduction

LLM Evaluation Strategy

(Current) LLM Evaluation Methodologies

Enterprise Use-case specific LLM Evaluation Strategy

领英推荐

LLM use-case accuracy (with respect to Hallucination)

Correctness vs. Groundedness

Limitations of RAGs with respect to resolving LLM Hallucinations

Conclusion

Debmalya Biswas的更多文章

Agentic AI for Data Engineering

Agentic AI Pricing Dimensions

Prompt Stores: Prompt Engineering for the Enterprise

Gen AI for Security: migrating Legacy Policies to LLM based Risk Classifiers

"Gossiping" AI Agents & their Privacy Implications

Federated Learning for Composite AI?Agents

Personalizing UX for Agentic AI: the last mile of Enterprise Adoption

AI Agents Marketplace & Discovery for Multi-agent Systems

Responsible AI Agents: Integrating Responsible AI practices into Agentic?AI

Deep Learning based Asbestos Fiber Detection

社区洞察

其他会员也浏览了

The AI Contribution Rating System (AICRS): A Framework to Measure AI Involvement

Driving Enterprise Growth: The Power of AI Grounding

Scalable Oversight

Importance of AI alignment for successful implementation

Custom GPTs: The Game-Changer for Modern Businesses

Cracking the Code: How to Craft Killer AI Prompts for Stellar Results

Navigating the Dual Learning Curve: AI and Human Transformation

AI and Me: A 1200-hour Exploration of Machine Intelligence

AI Adoption Is No Longer a Technology Initiative—It’s a Business Imperative

Neuro-Symbolic Approach to Knowledge Graphs