Use-case based evaluation of LLMs
Debmalya Biswas
AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA
Introduction
We are at a critical juncture in the Generative AI adoption journey, where we are have started hearing conflicting views reg. the transformative potential of Gen AI.?
Large Language Model (LLM) providers, e.g., Open AI, Mistral, Google, Meta, etc. are rolling out one LLM after another?—?with every iteration smaller and more efficient than the previous one. But these are generic pre-trained LLMs without a clear business use-case in mind, or let’s say the business specific use-cases still need to be developed on top of these foundational LLMs. So these LLMs are only an enabler and not a measure of business impact by any means. We do of course have the hyperscalers and technology vendors touting the hundreds (or thousands) of LLM based use-cases that they have already implemented with quantified business value.
On the other hand, we are seeing enterprises / experts start to take a more “pessimistic” view on Gen AI. For example, the recent report by Goldman Sachs is a case in point. The title Gen AI: Too much Spend, Too little Benefit? is self-explanatory and I won’t go into details?—?suffice it to say that while nobody is dismissing the future potential of Gen AI, they are not seeing Gen AI (as of now) solve any complex business strategic problems.
One of the problems here is clearly that there is a lot of exploration / PoCs happening?—?without the PoCs moving into Production. According to some studies (e.g. Forbes, Everest), the percentage of Gen AI PoCs failing is as high as 80%–90%.?TruEra also highlighted this aspect in a recent study, where they surmised that "Only 11% of enterprises had moved more than 25% of their GenAI initiatives into production." They noted the need for continuous and programmatic LLM evaluation (& LLM Observability) as enterprises seek to move more of their LLM use-cases into production.
We argue that that one of the key reasons for this failure is a lack of a comprehensive LLM evaluation strategy for the PoCs, with targeted success metrics specific to the the use-cases.
The situation seems very similar to that of the seminal MLOps paper Hidden Technical Debt in Machine Learning Systems where researchers highlighted that training ML models forms only a small part of the overall ML training-to-deployment lifecycle. In the same way, assessing capabilities of the foundational LLMs is only a small part of performing use-case specific LLM evaluation of enterprise use-cases.
In this article, we take the first steps towards defining a comprehensive LLM evaluation strategy focused on enterprise use-cases. It is a multi-faceted problem with the need to design use-case specific validation tests covering both functional and non-functions metrics, taking into account the underlying LLM, solution architecture (RAG, fine-tuning), applicable regulations and enterprise Responsible AI guidelines / policies.
LLM Evaluation Strategy
A comprehensive LLM evaluation strategy is key to moving the developed solution from PoC to Production. It consists of the below 4 overlapping (and sometimes conflicting) evaluation criteria:
(Current) LLM Evaluation Methodologies
There are primarily 3 types of LLM evaluation methodologies prevalent today:
Let us first consider publicly available LLM leaderboards, e.g. Hugging Face Open LLM Leaderboard. While useful, they primarily focus on testing pre-trained LLMs on generic NLP tasks (e.g., Q&A, reasoning, sentence completion) using public datasets, e.g.
The key limitation here is that these leaderboards focus on assessing foundational (pre-trained) LLMs on generic NLP tasks. Enterprise use-case contextualization entails further applying RAG / fine-tuning the pre-trained LLMs with enterprise data. Hence, these generic benchmarking results cannot be applied as-is (are insufficient) to perform use-case specific LLM evaluation.
The LLM-as-a-Judge method uses an “evaluation” LLM (another pre-trained LLM) to evaluate the quality of responses of the target LLM, scoring them using methods like LangChain’s CriteriaEvalChain. Unfortunately, the use-case specific limitations persist in this case as well. It has the advantage of accelerating the LLM evaluation process, though (in most cases) at a higher cost given the use of a second LLM.
The last alternative is (outsourced) manual validation with the help of (enterprise) SMEs. While it can work as a fallback option, it has (high) cost and effort implications, needs to be planned taking into account SME availability, and performed in a standardized manner to accommodate human bias.?
Enterprise Use-case specific LLM Evaluation Strategy
In this section, we focus on the use-case specific LLM evaluation strategy. The point is that if the enterprise use-case is Finance, Legal, HR, etc. related; we need to design an evaluation strategy taking into account the underlying domain data, (sub-)topics, user queries, performance metrics, and regulatory requirements, etc. of the respective use-case.
For example, in a Contact Center context (one of the areas with highest Gen AI adoption today),
领英推荐
Given this, there is a need to design a Contact Center use-case specific LLM evaluation strategy taking into account the semantic context and distribution of the generated responses.
We start with a focussed discussion on use-case specific evaluation of LLM Accuracy.
LLM use-case accuracy (with respect to Hallucination)
LLM hallucination has (rightfully) been one of the most significant roadblocks to LLM productization. Many statistical metrics have been proposed to quantify the model accuracy:
Use-case
Let us consider an AI Product that provides summary of the user response of a campaign in social media based on the input provided by the Marketing manager.
User Input: Can you please provide the insights where we can do better for the upcoming campaign, especially, for chocolates?
LLM Output: For your chocolate product campaign, highlight its unique flavor profiles through visually appealing and mouth-watering imagery on social media platforms. Consider partnering with popular food bloggers or influencers to create enticing content and reach a wider audience. Offer limited-time promotions or exclusive discounts to create a sense of urgency and drive sales.
For the above example, Perplexity will be low, BLEU and ROGUE will be high.
Correctness vs. Groundedness
Broadly, the LLM use-case accuracy can be measured in terms of:
Studies have shown how a response could be correct, but still improperly grounded.?
This might happen when retrieval results are irrelevant, but the solution somehow manages to produce the correct answer, falsely asserting that an unrelated document supports its conclusion.?
Let’s face it: Hallucinations are an inherent characteristic of any text generation system. Without generation (and by that logic hallucination), an LLM will simply become a retrieval system without the ability to generate any ‘new’ text.
Limitations of RAGs with respect to resolving LLM Hallucinations
RAGs have been widely promoted as a solution to hallucinations. Unfortunately, while RAGs can limit hallucinations, they cannot totally eliminate hallucinations.
In a recent study, Manning et. al. highlighted the limitations of RAGs for Legal use-cases. In a legal setting, there are primarily 3 ways a model can hallucinate:?
They focus on factual hallucination and highlight many retrieval challenges specific to the legal domain, e.g.,
In short, they show that while RAGs can help in reducing the hallucinations of state-of-the-art pre-trained GPT models, they still hallucinate 17% and 33% of the time.
Conclusion
In this article, we showed that a use-case based evaluation of LLMs is critical to LLM productionization in enterprises. We summarized current LLM evaluation strategies that primarily focus on benchmarking pre-trained LLMs on generic NLP tasks. We then outlined a comprehensive evaluation strategy that builds on this foundational LLM evaluation, taking into account the data and conversation related requirements / distribution of the underlying use-cases.?
We primarily focussed on LLM accuracy evaluation with respect to hallucinations in this article. The plan is to extend this into a series of articles covering other LLM evaluation dimensions in the future, starting with the responsible AI metrics, e.g., toxicity, fairness, privacy.
Data & Analysis for Drug Discovery, Diagnostics, Health Care Knowledge Graph, AI, ML, LLM, NLP, Bioinformatics Business Development, Account Management
7 个月Thanks for sharing
AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA
8 个月The article is now published in Towards Data Science https://towardsdatascience.com/enterprise-use-case-based-evaluation-of-llms-abcf8292889f
Author of 'Enterprise Architecture Fundamentals', Founder & Owner of Caminao
8 个月Technologies can only be evaluated through the applications that make use of them. The first step is thus to define the enterprise use cases: https://caminao.blog/overview/knowledge-kaleidoscope/ontologies-use-cases/ And then to be more specific about embedded LLM technologies
Senior Data & Analytics Manager - Ambev | AI | GEN AI | digital insights
8 个月Matheus Tocchini Guilherme Garcia Pedro Sergio