登录查看更多内容

Why Human Feedback Isn’t the Gold Standard for Evaluating AI Models

Anish Agarwal

Vice President & Head of AI & Automation

发布日期: 2024年9月14日

Introduction

As Large Language Models (LLMs) have become increasingly sophisticated, the role of human feedback in their evaluation and fine-tuning has risen to prominence. Human feedback has often been positioned as the gold standard, playing a key role in shaping model behaviours and outputs through reinforcement learning techniques such as Reinforcement Learning from Human Feedback (RLHF). While this approach has undeniable merits, it also poses significant challenges. Relying too heavily on human feedback may limit the robustness, fairness, and generalisation of LLMs, ultimately impacting their ability to provide high-quality and objective responses.

This article explores why human feedback, though valuable, is not a flawless measure for evaluating LLMs. We examine its limitations and consider alternative strategies for more holistic model evaluation.

The Emergence of Human Feedback in LLMs

Human feedback has become a widely adopted method for guiding and evaluating the performance of LLMs. Human raters are often tasked with scoring the outputs of these models based on their relevance, fluency, and adherence to prompt guidelines. When using RLHF, models are trained to optimise responses according to these preferences, improving their ability to align with human-like responses in various tasks.

A key example is the use of Reward Models in training LLMs. Reward Models are trained on preference data from humans, which are then used to fine-tune the model to generate outputs that align with human expectations. However, recent studies have shown that optimising models based on such feedback can result in "reward hacking" —where models learn to exploit the preference function without genuinely improving on task performance.

Human feedback is appealing for several reasons:

Subjective Preferences: It allows models to incorporate subjective human preferences, which can guide them toward socially acceptable or contextually relevant outputs.

Interactive Learning: Real-time feedback can facilitate adaptive learning, helping models quickly align with task-specific needs.

Familiar Metric: People are accustomed to using human judgment as a benchmark for quality in many domains, making it an intuitive method for assessing models.

However, human feedback should not be regarded as the ultimate measure of model quality for several reasons.

Limitations of Human Feedback as a Gold Standard

1.?Subjectivity and Bias

Human feedback is inherently subjective. Different individuals may interpret the same prompt and output differently based on personal experiences, cultural backgrounds, and even mood. This variation can introduce inconsistencies in model evaluation. For instance, what one person views as a creative or insightful response, another may find irrelevant or overly verbose.

Bias is another critical issue. Human raters may unconsciously favour responses that align with their beliefs or expectations, reinforcing those biases within the model. For instance, a politically neutral prompt might receive varying evaluations depending on the reviewer’s political leanings, affecting the objectivity of the model.

For example, research from Zhou et al. (2023) showed that different individuals often provide conflicting feedback on the same model output. In their study, raters from diverse cultural and linguistic backgrounds assigned varying scores to identical responses, highlighting how subjective interpretation can skew evaluation . This is problematic when evaluating LLMs, which are expected to perform consistently across contexts.

2. Preference Does Not Equal Performance

Human feedback typically uses a "preference score" to rank the outputs of LLMs. However, this score often fails to capture the nuanced properties of the output, such as factual accuracy, logical consistency, and ethical alignment. A high preference score might indicate that a response is likable or easy to read, but it doesn't guarantee that the response is accurate or free from harmful content.

A recent study by Gao et al. (2023) pointed out that high human preference scores do not always correlate with objective correctness . They tested LLMs on factual knowledge tasks and found that responses rated highly by humans were often factually incorrect or misleading. This is particularly concerning in fields like healthcare or law, where accuracy is critical. For instance, a fluently written but incorrect medical diagnosis may receive a high preference score, posing significant risks in real-world applications.

3.?Generalisation Issues

Relying on human feedback during model training can make LLMs overly specialised to human preferences in specific contexts. While this might enhance performance on similar tasks, it limits the model’s ability to generalize across diverse scenarios. For instance, a model trained primarily with human feedback in English-speaking contexts may struggle to perform adequately in multilingual environments or with culturally diverse user bases.

For example, a model trained primarily with English-speaking human feedback may struggle to generalise across different languages or cultural contexts, as illustrated in Research by Welbl et al. (2022). This can result in performance degradation when the model is deployed in real-world, global settings.

The issue of overfitting to human feedback also arises when models are pushed to optimise for human-preferred responses, which might not necessarily align with ground-truth data. This can result in models that are good at mimicking human-like behaviour but are not necessarily grounded in objective truth.

?4.?Ethical Implications

Human feedback often reflects existing societal biases, which can be amplified when used to train LLMs. This feedback loop can lead to ethical concerns, where models reinforce harmful stereotypes or exhibit biased behaviour because their training objectives are aligned with biased human judgments. Moreover, if feedback is sourced from only a limited demographic group, the model's outputs will naturally skew toward the preferences and biases of that group, compromising the model's inclusivity.

Research by Bender et al. (2021)? shows that human feedback often reinforces societal biases. Feedback that reflects stereotypes—whether consciously or unconsciously—can be absorbed by LLMs and amplified through repeated interactions. This leads to ethical concerns, particularly when the feedback is biased against marginalised groups.

Without objective oversight, the use of human feedback in sensitive areas like criminal justice, education, or hiring could lead to unintended consequences, further entrenching systemic biases.

Towards a More Comprehensive Evaluation Framework

Given the limitations of human feedback, it’s clear that relying solely on this method for model evaluation is insufficient. A more comprehensive evaluation framework should include the following:

领英推荐

Exploring AI Through Hands-On Playful Learning

Chunka Mui 1 年前

GPT-3 and Humor Generation; Reasoning and Acting in…

Danny Butvinik 1 年前

Reinforcement Learning Utilising Human Feedback for…

Mrinmoy Paul ???? 4 个月前

1.?Objective Metrics

Integrating objective, task-specific metrics into the evaluation framework can help mitigate the shortcomings of human preference scores. For example, in tasks that require factual correctness, the use of automated fact-checking systems or ground-truth comparisons can provide a more accurate measure of model performance.

Research from Kiela et al. (2021) recommends integrating objective, task-specific metrics like BLEU or ROUGE scores for linguistic tasks and precision/recall metrics for factual correctness . Objective metrics ensure that models are not just generating outputs that humans like, but that they are also accurate and aligned with truth.

Example: In machine translation tasks, BLEU scores can provide an objective measure of how closely a model’s translations match ground truth references. While human feedback might prefer more creative translations, BLEU captures how faithful the translation is to the original text.

2.?Automated Evaluation Systems

Machine-based evaluation methods, such as BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, can provide consistent, unbiased measurements of linguistic quality, although these tools are not without their limitations. When combined with human feedback, automated evaluations offer a more balanced approach, ensuring that models are evaluated on both human-centric and objective criteria.

Research from Mehri and Eskenazi (2020) proposes using automated systems that can provide consistency in evaluations. These systems, such as metric-based evaluations or automated fact-checking tools, can help reduce the subjectivity of human evaluations and offer a more balanced assessment.

Example: Using an automated fact-checking system to assess the truthfulness of generated outputs in a medical question-answering system can prevent cases where human feedback might prefer fluency over factual correctness.

3. Diverse Human Feedback Pools

Expanding the diversity of human evaluators is essential to reduce bias. Feedback should be collected from a wide range of demographic groups, cultural backgrounds, and expertise levels to ensure that models are aligned with a more inclusive representation of human preferences.

Expanding the diversity of human evaluators is crucial to mitigating bias. A study by Clark et al. (2021)suggests sourcing feedback from diverse demographic groups to ensure models are aligned with global human preferences.

Example: Including domain experts in the feedback loop is crucial for high-stakes applications, ensuring that models are evaluated by individuals who can accurately assess factual correctness and adherence to professional standards.

4. Robustness Testing

A model’s robustness should be tested across various conditions, such as adversarial prompts, multilingual settings, and real-world scenarios where human preferences may vary widely. This can help prevent overfitting to specific types of feedback and ensure that models generalise well to diverse environments.

The robustness of LLMs should be tested in a wide variety of scenarios to ensure generalisation beyond human preferences. Adversarial testing, multilingual evaluations, and testing under real-world conditions are crucial, as demonstrated by Gururangan et al. (2020).

Example: A model might be tested for robustness by generating outputs for adversarially phrased questions designed to trick it. This ensures that the model does not just optimise for human preferences but also produces factually correct and logically coherent responses.

Conclusion

While human feedback plays a vital role in shaping the behaviours and outputs of LLMs, it is not a perfect or comprehensive evaluation metric. The subjectivity, bias, and limitations inherent in human feedback make it clear that this method should be supplemented with more objective measures to ensure that models are robust, fair, and aligned with factual truths.

As LLMs become more integrated into real-world applications, moving beyond human feedback as the gold standard will be essential for building models that not only align with human preferences but also meet objective criteria for accuracy, consistency, and ethical behaviour. A multi-dimensional approach to model evaluation is necessary to achieve the full potential of AI in addressing complex, real-world challenges.

References

1.?Christiano, P., et al. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems. https://arxiv.org/abs/1706.03741

2.?Zhou, K., et al. (2023). Assessing the Subjectivity of Human Evaluations. arXiv preprint. https://arxiv.org/abs/2303.00753

3. Gao, X., et al. (2023).Challenges in Reinforcement Learning with Human Feedback. arXiv preprint. https://arxiv.org/abs/2302.05688

4.?Welbl, J., et al. (2022). Human feedback misalignment in multilingual LLMs. International Conference on Learning Representations. https://arxiv.org/abs/2202.06439

5.?Bender, E. M., et al. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?* Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://dl.acm.org/doi/10.1145/3442188.3445922

6. Zhao, J., et al. (2017). Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. arXiv preprint. https://arxiv.org/abs/1707.09457

7.?Kiela, D., et al. (2021). The Importance of Objective Metrics in Large Language Model Evaluation. NeurIPS. https://arxiv.org/abs/2108.07258

8.?Mehri, S., & Eskenazi, M. (2020). USR: An Unsupervised and Reference-free Evaluation Metric for Dialog Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://arxiv.org/abs/2005.00462

9.?Clark, C., et al. (2021). Diverse Human Feedback for Robust LLM Evaluation. arXiv preprint. https://arxiv.org/abs/2109.07341

10.?Gururangan, S., et al. (2020). Adversarial Testing in LLMs. Association for Computational Linguistics. https://arxiv.org/abs/2007.01969

Disclaimer: The opinions and perspectives presented in this article are solely based on my independent research and analysis. They do not reflect or represent the official strategies, views, or internal policies of any organisation or company with which I am or have been affiliated.

Dr. Indranil Mitra, PhD, MBA, IPR, F- RSS (UK), FRSA (UK)

Partner @ PwC | Lead - AI | Gen AI | Advanced Analytics | Decision Science

6 个月

Very insightful Anish Agarwal

Sunil Patnaik

Business development & Strategy @XTS || Fujitsu 100 Winner || MBA in Marketing and Analytics ||

6 个月

Great read! I found your points about the biases in human feedback really insightful. Combining it with other evaluation methods is definitely the way to go for a more accurate assessment of AI models. Thanks for sharing!

Jens Nestel

AI and Digital Transformation, Chemical Scientist, MBA.

6 个月

Humans bring biases. Diverse objective testing exceeds human feedback limits.

查看更多评论

要查看或添加评论，请登录

Anish Agarwal的更多文章

The Future After Generative AI: A New Era of Intelligence Unfolds

2024年11月17日

The Future After Generative AI: A New Era of Intelligence Unfolds

The world of artificial intelligence is constantly evolving, and while generative AI has dominated headlines and…

3 条评论
Generative AI's Hidden Weakness

2024年11月2日

Generative AI's Hidden Weakness

Generative AI, celebrated for its breakthroughs in text, image, and content generation, faces a new and unexpected…

1 条评论
Plato, Aristotle, and AI: Ancient Wisdom Meets Modern Tech

2024年9月6日

Plato, Aristotle, and AI: Ancient Wisdom Meets Modern Tech

Introduction: The Philosophical Awakening in AI Development The rapid growth of artificial intelligence (AI) is not…

5 条评论
Navigating the Maze of GenAI: Understanding Conflicting Prompts and Building Enterprise Prompt Stores

2024年8月31日

Navigating the Maze of GenAI: Understanding Conflicting Prompts and Building Enterprise Prompt Stores

In the rapidly evolving world of AI, businesses are increasingly leveraging Generative AI to create, innovate, and…

2 条评论
Generative AI vs. Predictive AI: A Teen's Guide to Understanding the Future of Technology

2024年8月24日

Generative AI vs. Predictive AI: A Teen's Guide to Understanding the Future of Technology

Introduction: Understanding the Basics Artificial Intelligence (AI) is a fascinating field that's rapidly changing our…
Crafting a Resilient GenAI Strategy for Enterprise Transformation

2024年8月22日

Crafting a Resilient GenAI Strategy for Enterprise Transformation

As enterprises navigate the complexities of the digital age, Generative AI (GenAI) stands as a transformative force…

3 条评论
Mastering Prompt Engineering: Unlock the Full Potential of Generative AI with Expert Techniques and Types of Prompts

2024年8月10日

Mastering Prompt Engineering: Unlock the Full Potential of Generative AI with Expert Techniques and Types of Prompts

Generative AI (GenAI) has emerged as a transformative force, capable of generating text, images, music, and even code…

4 条评论
Data Scientist 2.0: From Number Cruncher to AI Collaborator

2024年7月27日

Data Scientist 2.0: From Number Cruncher to AI Collaborator

The landscape of data science is undergoing a seismic shift. Once a domain dominated by statistical prowess and coding…

4 条评论
Multi-Agent LLM Systems: The Future of Collaborative AI

2024年7月20日

Multi-Agent LLM Systems: The Future of Collaborative AI

Large Language Models (LLMs) have taken the world by storm, demonstrating exceptional capabilities in text generation…

6 条评论
GEN AI: Too Much Spend, Too Little Benefit?

2024年7月13日

GEN AI: Too Much Spend, Too Little Benefit?

The recent surge in Generative AI (Gen AI) has captured the imagination of tech giants and investors alike. Promises of…

6 条评论

See all articles

Why Human Feedback Isn’t the Gold Standard for Evaluating AI Models

Anish Agarwal

Vice President & Head of AI & Automation

领英推荐

Anish Agarwal的更多文章

社区洞察

其他会员也浏览了

The Looming AI Collapse: Navigating the Risks of Self-Referential Learning

Why feedback loops are essential for your LLM

Beyond "ChatGPT as a Fancy Calculator": Rethinking AI in the Classroom (and Why SAMR Isn't Enough)

The Perpetual Loop Prompt: A New Approach for Training AI Language Models for Better Results

Harnessing AI to Craft Policies and Procedures: A Strategic Guide

How Do Managers Learn Management in the Future?

How we can accelerate our story skills with the help of AI

Frankenstein AI: Can ChatGPT really promote learning?

The Necessity of Human Guidance in Shaping AI’s Most Advanced Learning

?? The Operator Agent: A Game-Changer in AI Technology ??

领英推荐

Anish Agarwal的更多文章

The Future After Generative AI: A New Era of Intelligence Unfolds

Generative AI's Hidden Weakness

Plato, Aristotle, and AI: Ancient Wisdom Meets Modern Tech

Navigating the Maze of GenAI: Understanding Conflicting Prompts and Building Enterprise Prompt Stores

Generative AI vs. Predictive AI: A Teen's Guide to Understanding the Future of Technology

Crafting a Resilient GenAI Strategy for Enterprise Transformation

Mastering Prompt Engineering: Unlock the Full Potential of Generative AI with Expert Techniques and Types of Prompts

Data Scientist 2.0: From Number Cruncher to AI Collaborator

Multi-Agent LLM Systems: The Future of Collaborative AI

GEN AI: Too Much Spend, Too Little Benefit?

社区洞察

其他会员也浏览了

The Looming AI Collapse: Navigating the Risks of Self-Referential Learning

Why feedback loops are essential for your LLM

Beyond "ChatGPT as a Fancy Calculator": Rethinking AI in the Classroom (and Why SAMR Isn't Enough)

The Perpetual Loop Prompt: A New Approach for Training AI Language Models for Better Results

Harnessing AI to Craft Policies and Procedures: A Strategic Guide

How Do Managers Learn Management in the Future?

How we can accelerate our story skills with the help of AI

Frankenstein AI: Can ChatGPT really promote learning?

The Necessity of Human Guidance in Shaping AI’s Most Advanced Learning

?? The Operator Agent: A Game-Changer in AI Technology ??