Use-case based evaluation of LLMs
Fig: Enterprise LLM use-cases evaluation strategy

Use-case based evaluation of LLMs

Introduction

We are at a critical juncture in the Generative AI adoption journey, where we are have started hearing conflicting views reg. the transformative potential of Gen AI.?

Large Language Model (LLM) providers, e.g., Open AI, Mistral, Google, Meta, etc. are rolling out one LLM after another?—?with every iteration smaller and more efficient than the previous one. But these are generic pre-trained LLMs without a clear business use-case in mind, or let’s say the business specific use-cases still need to be developed on top of these foundational LLMs. So these LLMs are only an enabler and not a measure of business impact by any means. We do of course have the hyperscalers and technology vendors touting the hundreds (or thousands) of LLM based use-cases that they have already implemented with quantified business value.

On the other hand, we are seeing enterprises / experts start to take a more “pessimistic” view on Gen AI. For example, the recent report by Goldman Sachs is a case in point. The title Gen AI: Too much Spend, Too little Benefit? is self-explanatory and I won’t go into details?—?suffice it to say that while nobody is dismissing the future potential of Gen AI, they are not seeing Gen AI (as of now) solve any complex business strategic problems.

One of the problems here is clearly that there is a lot of exploration / PoCs happening?—?without the PoCs moving into Production. According to some studies (e.g. Forbes, Everest), the percentage of Gen AI PoCs failing is as high as 80%–90%.?TruEra also highlighted this aspect in a recent study, where they surmised that "Only 11% of enterprises had moved more than 25% of their GenAI initiatives into production." They noted the need for continuous and programmatic LLM evaluation (& LLM Observability) as enterprises seek to move more of their LLM use-cases into production.

We argue that that one of the key reasons for this failure is a lack of a comprehensive LLM evaluation strategy for the PoCs, with targeted success metrics specific to the the use-cases.

The situation seems very similar to that of the seminal MLOps paper Hidden Technical Debt in Machine Learning Systems where researchers highlighted that training ML models forms only a small part of the overall ML training-to-deployment lifecycle. In the same way, assessing capabilities of the foundational LLMs is only a small part of performing use-case specific LLM evaluation of enterprise use-cases.

Fig: Enterprise use-case vs. foundational LLM evaluation

In this article, we take the first steps towards defining a comprehensive LLM evaluation strategy focused on enterprise use-cases. It is a multi-faceted problem with the need to design use-case specific validation tests covering both functional and non-functions metrics, taking into account the underlying LLM, solution architecture (RAG, fine-tuning), applicable regulations and enterprise Responsible AI guidelines / policies.

LLM Evaluation Strategy

A comprehensive LLM evaluation strategy is key to moving the developed solution from PoC to Production. It consists of the below 4 overlapping (and sometimes conflicting) evaluation criteria:

  • Response accuracy and relevance
  • User experience: improving user satisfaction
  • Cost containment and Energy efficiency
  • Adherence to Responsible AI guidelines and Regulatory compliance

Fig: LLM evaluation criteria

(Current) LLM Evaluation Methodologies

There are primarily 3 types of LLM evaluation methodologies prevalent today:

  • Generic benchmarks and datasets
  • LLM-as-a-Judge
  • Manual evaluation

Let us first consider publicly available LLM leaderboards, e.g. Hugging Face Open LLM Leaderboard. While useful, they primarily focus on testing pre-trained LLMs on generic NLP tasks (e.g., Q&A, reasoning, sentence completion) using public datasets, e.g.

  • ?SQuaD 2.0?—?Q&A
  • Alpaca Eval? —?Instruction following
  • ?GLUE?—?Natural Language Understanding (NLU) tasks
  • MMLU?—?Multi-task Language Understanding?
  • DecodingTrust?—?Responsible AI dimensions, the framework underlying HuggingFace’s LLM Safety Leaderboard

The key limitation here is that these leaderboards focus on assessing foundational (pre-trained) LLMs on generic NLP tasks. Enterprise use-case contextualization entails further applying RAG / fine-tuning the pre-trained LLMs with enterprise data. Hence, these generic benchmarking results cannot be applied as-is (are insufficient) to perform use-case specific LLM evaluation.
Fig: Enterprise LLM contextualization

The LLM-as-a-Judge method uses an “evaluation” LLM (another pre-trained LLM) to evaluate the quality of responses of the target LLM, scoring them using methods like LangChain’s CriteriaEvalChain. Unfortunately, the use-case specific limitations persist in this case as well. It has the advantage of accelerating the LLM evaluation process, though (in most cases) at a higher cost given the use of a second LLM.

The last alternative is (outsourced) manual validation with the help of (enterprise) SMEs. While it can work as a fallback option, it has (high) cost and effort implications, needs to be planned taking into account SME availability, and performed in a standardized manner to accommodate human bias.?

Enterprise Use-case specific LLM Evaluation Strategy

In this section, we focus on the use-case specific LLM evaluation strategy. The point is that if the enterprise use-case is Finance, Legal, HR, etc. related; we need to design an evaluation strategy taking into account the underlying domain data, (sub-)topics, user queries, performance metrics, and regulatory requirements, etc. of the respective use-case.

For example, in a Contact Center context (one of the areas with highest Gen AI adoption today),

  • summarization use-cases can vary widely, from condensing customer complaints, to outlining the outcomes of sales calls, to extracting the values of subjects mentioned in the call.
  • Call Center transcripts also suffer from incomplete calls and conversations spanning multiple topics. In calls discussing multiple subjects, LLMs may unintentionally leave out important information - impacting the completeness of the summary.
  • From a conversation perspective, summarizing a technical support call requires a different understanding and focus, as compared to summarizing a product inquiry call.

Given this, there is a need to design a Contact Center use-case specific LLM evaluation strategy taking into account the semantic context and distribution of the generated responses.

We start with a focussed discussion on use-case specific evaluation of LLM Accuracy.

LLM use-case accuracy (with respect to Hallucination)

LLM hallucination has (rightfully) been one of the most significant roadblocks to LLM productization. Many statistical metrics have been proposed to quantify the model accuracy:

  • Perplexity: This quantifies how a model predicts a text?—?lower the score better the model.
  • BLEU (Bilingual Evaluation Understudy): BLEU is a metric commonly used in machine translation tasks. It compares the generated output with one or more reference translations and measures the similarity between them. The higher the score better the model.
  • ROUGE (Recall-Oriented Understudy for Gissing Evaluation): ROUGE is a set of metrics used for evaluating the quality of summaries. It compares the generated summary with one or more reference summaries and calculates precision, recall, and F1-score. The higher the F1-Score, the better the model.

Use-case

Let us consider an AI Product that provides summary of the user response of a campaign in social media based on the input provided by the Marketing manager.

User Input: Can you please provide the insights where we can do better for the upcoming campaign, especially, for chocolates?

LLM Output: For your chocolate product campaign, highlight its unique flavor profiles through visually appealing and mouth-watering imagery on social media platforms. Consider partnering with popular food bloggers or influencers to create enticing content and reach a wider audience. Offer limited-time promotions or exclusive discounts to create a sense of urgency and drive sales.

For the above example, Perplexity will be low, BLEU and ROGUE will be high.

Correctness vs. Groundedness

Broadly, the LLM use-case accuracy can be measured in terms of:

  • Correctness: refers to the factual accuracy of the LLM’s response
  • Groundedness: refers to the relationship between the LLM’s response and its underlying KB.?

Studies have shown how a response could be correct, but still improperly grounded.?

This might happen when retrieval results are irrelevant, but the solution somehow manages to produce the correct answer, falsely asserting that an unrelated document supports its conclusion.?

Let’s face it: Hallucinations are an inherent characteristic of any text generation system. Without generation (and by that logic hallucination), an LLM will simply become a retrieval system without the ability to generate any ‘new’ text.

Limitations of RAGs with respect to resolving LLM Hallucinations

RAGs have been widely promoted as a solution to hallucinations. Unfortunately, while RAGs can limit hallucinations, they cannot totally eliminate hallucinations.

In a recent study, Manning et. al. highlighted the limitations of RAGs for Legal use-cases. In a legal setting, there are primarily 3 ways a model can hallucinate:?

  • it can be unfaithful to its training data,?
  • unfaithful to its prompt input,?
  • or unfaithful to the true facts of the world.

They focus on factual hallucination and highlight many retrieval challenges specific to the legal domain, e.g.,

  • Legal queries often do not have a single, clear-cut answer?—?the response is spread over multiple documents across time and location.
  • Document relevance in the legal context is not based on text similarity alone. In different jurisdictions and in different time periods, the applicable rule or the relevant jurisprudence may differ.

In short, they show that while RAGs can help in reducing the hallucinations of state-of-the-art pre-trained GPT models, they still hallucinate 17% and 33% of the time.

Conclusion

In this article, we showed that a use-case based evaluation of LLMs is critical to LLM productionization in enterprises. We summarized current LLM evaluation strategies that primarily focus on benchmarking pre-trained LLMs on generic NLP tasks. We then outlined a comprehensive evaluation strategy that builds on this foundational LLM evaluation, taking into account the data and conversation related requirements / distribution of the underlying use-cases.?

We primarily focussed on LLM accuracy evaluation with respect to hallucinations in this article. The plan is to extend this into a series of articles covering other LLM evaluation dimensions in the future, starting with the responsible AI metrics, e.g., toxicity, fairness, privacy.

Markus Dueringer

Data & Analysis for Drug Discovery, Diagnostics, Health Care Knowledge Graph, AI, ML, LLM, NLP, Bioinformatics Business Development, Account Management

7 个月

Thanks for sharing

Debmalya Biswas

AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

8 个月
回复
Rémy Fannader

Author of 'Enterprise Architecture Fundamentals', Founder & Owner of Caminao

8 个月

Technologies can only be evaluated through the applications that make use of them. The first step is thus to define the enterprise use cases: https://caminao.blog/overview/knowledge-kaleidoscope/ontologies-use-cases/ And then to be more specific about embedded LLM technologies

  • 该图片无替代文字
Andressa Schiessl

Senior Data & Analytics Manager - Ambev | AI | GEN AI | digital insights

8 个月

要查看或添加评论,请登录

Debmalya Biswas的更多文章

社区洞察

其他会员也浏览了