ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Mastering the Art of Evaluation: Key to Success in Generative AI

Shiv Ramanna

Product Leader | AI & ML | Agentic AI | Generative AI

å‘å¸ƒæ—¥æœŸ: 2024å¹´6æœˆ17æ—¥

In 2023, Generative AI emerged as a transformative force for many businesses, significantly demonstrating its potential impact. The adoption of this technology is accelerating, with usage nearly doubling in 2024, according to a recent McKinsey survey. As companies navigate their strategic optionsâ€”whether to implement an off-the-shelf solution, customize such a solution with proprietary data, or develop one from the ground upâ€”they face a prevalent challenge: effectively evaluating a large language model (LLM) for specific real-world applications.

The challenges of evaluating LLM responses

Evaluation is fundamental to any AI project, ensuring confidence in real-world performance and setting appropriate user expectations. For traditional machine learning models, metrics such as precision, recall, and accuracy are effectively measured through back-testing on historical data. However, generative AI presents unique challenges. Unlike traditional models that typically yield a single correct answer, large language models (LLMs) can produce a wide range of valid responses. The generated content is often subjective; what appears creative to one person might seem nonsensical to another. Additionally, LLMs are highly sensitive to the way they are prompted and the context provided. They can also produce "hallucinations"â€”responses that, while coherent and grammatically correct, are factually incorrect or misleading. Such errors can have serious consequences depending on the application's context.

Public Benchmarks and Why they fall short

LLM benchmark leaderboards compare model performances and are increasingly cited not only in academic research but also in popular media. Benchmarks such as MMLU, HellaSwag, and WinoGrande are standardized tests used to evaluate AI models' language understanding, reasoning and comprehension capabilities. Typically, a benchmark includes a dataset, a set of questions, and a scoring method.

However, these public benchmarks often do not align well with the specific domain and context in which the model will ultimately be used, rendering the results less reliable. Additionally, these benchmarks are available on the internet and consequently, when a model is trained on newer versions of the web (such as CommonCrawl), it may inadvertently be exposed to these benchmarks during pre-training unless specific measures are taken to filter the data. Moreover, models can become overfitted through further fine-tuning, exacerbating this issue.

Use case specific benchmarks

It is crucial to benchmark models with use-specific datasets, ensuring the following considerations are met:

The test set must be representative of real-world use cases.
The dataset should be sufficiently large to ensure statistically significant results and to capture variations in use case scenarios.
The dataset must be updated periodically to prevent it from becoming stale and to mitigate the risk of data leaks and model overfitting.
Having a golden label or answer that defines what constitutes a good response is essential for accurately rating the model's performance.

While techniques exist to generate both the test data and golden answers using a model, these can introduce biases and may not be truly representative. Therefore, it is imperative to involve human experts who understand the domain and application context in the creation of the test set.

é¢†è‹±æŽ¨è

6 questions that will dictate the future of generative AI

6 questions that will dictate the future of generativeâ€¦

MIT Technology Review 1 å¹´å‰

The Mirror and the Crystal Ball: Harnessing AIâ€™s Potential in a Changing World

The Mirror and the Crystal Ball: Harnessing AIâ€™sâ€¦

Dunelm 1 å¹´å‰

What is DeepSeek? Understanding the Impact of This Game-Changing AI Tool

What is DeepSeek? Understanding the Impact of Thisâ€¦

Saletancy 1 ä¸ªæœˆå‰

Human Evaluation vs. Judge Model Evaluation

Once a robust test set is established, model responses can be evaluated in two ways: by employing human evaluators to rate the responses or by using another model as a judge.

Best Practices for Human and Judge Model Evaluation

Human Evaluation

Clear Guidelines: Provide evaluators with detailed criteria for rating responses.
Diverse Evaluators: Use a diverse group of evaluators to minimize individual biases, ensuring they possess the necessary domain and application knowledge.
Pilot Testing: Conduct pilot tests to refine evaluation criteria.

Judge Model Evaluation

Regular Updates: Continuously train the judge model with new data.
Bias Checks: Regularly audit and mitigate biases.
Evaluate the Judge: Use independent datasets to assess and fine-tune the judge model

In conclusion, human evaluation is superior for scenarios requiring subjective judgment, creativity, and nuanced understanding, while judge model evaluation is optimal for large-scale, objective, and repetitive tasks. Adopting a hybrid approachâ€”where model checkpoints are regularly tested through automated judge evaluations, supplemented by human evaluations to spot-check results and certify performance on new models or use casesâ€”can be highly effective.

Mastering the art of evaluation is essential for unlocking the full potential of generative AI. By understanding the challenges and implementing best practices, you can ensure your AI models consistently deliver high-quality, user-centric outputs. Embrace these strategies to stay ahead in the ever-evolving landscape of Generative AI.

Manjeet Singh

8 ä¸ªæœˆ

Good writeup Shiv Ramanna. I agree that use case specific Benchmarks produced using fresh and diverse dataset are more valuable. For this exact reason we recently built a LLM Benchmarks for CRM use cases and launched today. It is built using both Auto Eval and Human Eval - https://www.salesforce.com/news/stories/crm-benchmark/

èµž

å›žå¤

1 æ¬¡å›žåº”

Bruce Walthers

AI Product Manager, Data Scientist, Generative AI Expert, Futurist and Thought Leader

8 ä¸ªæœˆ

I really like your conclusion here. A hybrid approach of using LLMs as a Judge and using humans to check the model at various checkpoints. Then you get both the scalability of an LLM Judge and the nuanced understanding of a human-in-the-loop.

èµž

å›žå¤

1 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Shiv Ramannaçš„æ›´å¤šæ–‡ç«

Now LLM â€“ Generative LLM for enterprise AI use-cases

2023å¹´10æœˆ2æ—¥

Now LLM â€“ Generative LLM for enterprise AI use-cases

With the recently announced Vancouver release, we are releasing Now LLM, ServiceNow large language models forâ€¦

2 æ¡è¯„è®º
Servicenow language model - Powering Conversational Experiences

2021å¹´10æœˆ1æ—¥

Servicenow language model - Powering Conversational Experiences

This article was originally posted in ServiceNow community here. Every company that wants to use Natural Languageâ€¦

1 æ¡è¯„è®º
Beat Your Forecasting Blues With Predictive Forecasting

2016å¹´3æœˆ3æ—¥

Beat Your Forecasting Blues With Predictive Forecasting

"Nestle Profit Declines as It Misses Sales Forecast", "Deere offers soft outlook after sales miss", "EBay forecastâ€¦

1 æ¡è¯„è®º
"Prospective Hindsight" for Sales via Predictive Analytics

2016å¹´2æœˆ20æ—¥

"Prospective Hindsight" for Sales via Predictive Analytics

We all succeed at times and fail at times. And then, we look back, retrospect and wonder how things could have goneâ€¦

5 æ¡è¯„è®º
Harness Data Science To Increase Your Sales Velocity

2015å¹´4æœˆ24æ—¥

Harness Data Science To Increase Your Sales Velocity

If youâ€™re in sales, you probably donâ€™t consider yourself to be a science nerd. But many sales teams are learning thatâ€¦

See all articles

Mastering the Art of Evaluation: Key to Success in Generative AI

Shiv Ramanna

Product Leader | AI & ML | Agentic AI | Generative AI

The challenges of evaluating LLM responses

Public Benchmarks and Why they fall short

Use case specific benchmarks

é¢†è‹±æŽ¨è

Human Evaluation vs. Judge Model Evaluation

Best Practices for Human and Judge Model Evaluation

Shiv Ramannaçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

DeepSeek: The Challenger, Reasoning AIâ€™s Story, and the Future of the AI Industry

DeepSeek R1: Pioneering the Next Generation of AI Reasoning

Announcing Gen II: Merlinâ€™s Second-Generation Generative AI Platform

The Rise of Generative AI: Understanding Its Market Dynamics, Economic Potential, and Policy Implications

Top Trends in Generative AI for 2025

7 Pioneering USA Based AI Companies Pushing Boundaries

Is AI Progress Slowing?

Industry-First Observability Features Elevate Trust in Generative AI Applications

Why the Human approach is crucial in the development of your AI models?

Is ESG ready for AI?

The challenges of evaluating LLM responses

Public Benchmarks and Why they fall short

Use case specific benchmarks

é¢†è‹±æŽ¨è

Human Evaluation vs. Judge Model Evaluation

Best Practices for Human and Judge Model Evaluation

Shiv Ramannaçš„æ›´å¤šæ–‡ç«

Now LLM â€“ Generative LLM for enterprise AI use-cases

Servicenow language model - Powering Conversational Experiences

Beat Your Forecasting Blues With Predictive Forecasting

"Prospective Hindsight" for Sales via Predictive Analytics

Harness Data Science To Increase Your Sales Velocity

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

DeepSeek: The Challenger, Reasoning AIâ€™s Story, and the Future of the AI Industry

DeepSeek R1: Pioneering the Next Generation of AI Reasoning

Announcing Gen II: Merlinâ€™s Second-Generation Generative AI Platform

The Rise of Generative AI: Understanding Its Market Dynamics, Economic Potential, and Policy Implications

Top Trends in Generative AI for 2025

7 Pioneering USA Based AI Companies Pushing Boundaries

Is AI Progress Slowing?

Industry-First Observability Features Elevate Trust in Generative AI Applications

Why the Human approach is crucial in the development of your AI models?

Is ESG ready for AI?

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†