Why Human Feedback Isn’t the Gold Standard for Evaluating AI Models
Introduction
As Large Language Models (LLMs) have become increasingly sophisticated, the role of human feedback in their evaluation and fine-tuning has risen to prominence. Human feedback has often been positioned as the gold standard, playing a key role in shaping model behaviours and outputs through reinforcement learning techniques such as Reinforcement Learning from Human Feedback (RLHF). While this approach has undeniable merits, it also poses significant challenges. Relying too heavily on human feedback may limit the robustness, fairness, and generalisation of LLMs, ultimately impacting their ability to provide high-quality and objective responses.
This article explores why human feedback, though valuable, is not a flawless measure for evaluating LLMs. We examine its limitations and consider alternative strategies for more holistic model evaluation.
The Emergence of Human Feedback in LLMs
Human feedback has become a widely adopted method for guiding and evaluating the performance of LLMs. Human raters are often tasked with scoring the outputs of these models based on their relevance, fluency, and adherence to prompt guidelines. When using RLHF, models are trained to optimise responses according to these preferences, improving their ability to align with human-like responses in various tasks.
A key example is the use of Reward Models in training LLMs. Reward Models are trained on preference data from humans, which are then used to fine-tune the model to generate outputs that align with human expectations. However, recent studies have shown that optimising models based on such feedback can result in "reward hacking" —where models learn to exploit the preference function without genuinely improving on task performance.
Human feedback is appealing for several reasons:
Subjective Preferences: It allows models to incorporate subjective human preferences, which can guide them toward socially acceptable or contextually relevant outputs.
Interactive Learning: Real-time feedback can facilitate adaptive learning, helping models quickly align with task-specific needs.
Familiar Metric: People are accustomed to using human judgment as a benchmark for quality in many domains, making it an intuitive method for assessing models.
However, human feedback should not be regarded as the ultimate measure of model quality for several reasons.
Limitations of Human Feedback as a Gold Standard
1.?Subjectivity and Bias
Human feedback is inherently subjective. Different individuals may interpret the same prompt and output differently based on personal experiences, cultural backgrounds, and even mood. This variation can introduce inconsistencies in model evaluation. For instance, what one person views as a creative or insightful response, another may find irrelevant or overly verbose.
Bias is another critical issue. Human raters may unconsciously favour responses that align with their beliefs or expectations, reinforcing those biases within the model. For instance, a politically neutral prompt might receive varying evaluations depending on the reviewer’s political leanings, affecting the objectivity of the model.
For example, research from Zhou et al. (2023) showed that different individuals often provide conflicting feedback on the same model output. In their study, raters from diverse cultural and linguistic backgrounds assigned varying scores to identical responses, highlighting how subjective interpretation can skew evaluation . This is problematic when evaluating LLMs, which are expected to perform consistently across contexts.
2. Preference Does Not Equal Performance
Human feedback typically uses a "preference score" to rank the outputs of LLMs. However, this score often fails to capture the nuanced properties of the output, such as factual accuracy, logical consistency, and ethical alignment. A high preference score might indicate that a response is likable or easy to read, but it doesn't guarantee that the response is accurate or free from harmful content.
A recent study by Gao et al. (2023) pointed out that high human preference scores do not always correlate with objective correctness . They tested LLMs on factual knowledge tasks and found that responses rated highly by humans were often factually incorrect or misleading. This is particularly concerning in fields like healthcare or law, where accuracy is critical. For instance, a fluently written but incorrect medical diagnosis may receive a high preference score, posing significant risks in real-world applications.
3.?Generalisation Issues
Relying on human feedback during model training can make LLMs overly specialised to human preferences in specific contexts. While this might enhance performance on similar tasks, it limits the model’s ability to generalize across diverse scenarios. For instance, a model trained primarily with human feedback in English-speaking contexts may struggle to perform adequately in multilingual environments or with culturally diverse user bases.
For example, a model trained primarily with English-speaking human feedback may struggle to generalise across different languages or cultural contexts, as illustrated in Research by Welbl et al. (2022). This can result in performance degradation when the model is deployed in real-world, global settings.
The issue of overfitting to human feedback also arises when models are pushed to optimise for human-preferred responses, which might not necessarily align with ground-truth data. This can result in models that are good at mimicking human-like behaviour but are not necessarily grounded in objective truth.
?4.?Ethical Implications
Human feedback often reflects existing societal biases, which can be amplified when used to train LLMs. This feedback loop can lead to ethical concerns, where models reinforce harmful stereotypes or exhibit biased behaviour because their training objectives are aligned with biased human judgments. Moreover, if feedback is sourced from only a limited demographic group, the model's outputs will naturally skew toward the preferences and biases of that group, compromising the model's inclusivity.
Research by Bender et al. (2021)? shows that human feedback often reinforces societal biases. Feedback that reflects stereotypes—whether consciously or unconsciously—can be absorbed by LLMs and amplified through repeated interactions. This leads to ethical concerns, particularly when the feedback is biased against marginalised groups.
Without objective oversight, the use of human feedback in sensitive areas like criminal justice, education, or hiring could lead to unintended consequences, further entrenching systemic biases.
Towards a More Comprehensive Evaluation Framework
Given the limitations of human feedback, it’s clear that relying solely on this method for model evaluation is insufficient. A more comprehensive evaluation framework should include the following:
领英推荐
1.?Objective Metrics
Integrating objective, task-specific metrics into the evaluation framework can help mitigate the shortcomings of human preference scores. For example, in tasks that require factual correctness, the use of automated fact-checking systems or ground-truth comparisons can provide a more accurate measure of model performance.
Research from Kiela et al. (2021) recommends integrating objective, task-specific metrics like BLEU or ROUGE scores for linguistic tasks and precision/recall metrics for factual correctness . Objective metrics ensure that models are not just generating outputs that humans like, but that they are also accurate and aligned with truth.
Example: In machine translation tasks, BLEU scores can provide an objective measure of how closely a model’s translations match ground truth references. While human feedback might prefer more creative translations, BLEU captures how faithful the translation is to the original text.
2.?Automated Evaluation Systems
Machine-based evaluation methods, such as BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, can provide consistent, unbiased measurements of linguistic quality, although these tools are not without their limitations. When combined with human feedback, automated evaluations offer a more balanced approach, ensuring that models are evaluated on both human-centric and objective criteria.
Research from Mehri and Eskenazi (2020) proposes using automated systems that can provide consistency in evaluations. These systems, such as metric-based evaluations or automated fact-checking tools, can help reduce the subjectivity of human evaluations and offer a more balanced assessment.
Example: Using an automated fact-checking system to assess the truthfulness of generated outputs in a medical question-answering system can prevent cases where human feedback might prefer fluency over factual correctness.
3. Diverse Human Feedback Pools
Expanding the diversity of human evaluators is essential to reduce bias. Feedback should be collected from a wide range of demographic groups, cultural backgrounds, and expertise levels to ensure that models are aligned with a more inclusive representation of human preferences.
Expanding the diversity of human evaluators is crucial to mitigating bias. A study by Clark et al. (2021)suggests sourcing feedback from diverse demographic groups to ensure models are aligned with global human preferences.
Example: Including domain experts in the feedback loop is crucial for high-stakes applications, ensuring that models are evaluated by individuals who can accurately assess factual correctness and adherence to professional standards.
4. Robustness Testing
A model’s robustness should be tested across various conditions, such as adversarial prompts, multilingual settings, and real-world scenarios where human preferences may vary widely. This can help prevent overfitting to specific types of feedback and ensure that models generalise well to diverse environments.
The robustness of LLMs should be tested in a wide variety of scenarios to ensure generalisation beyond human preferences. Adversarial testing, multilingual evaluations, and testing under real-world conditions are crucial, as demonstrated by Gururangan et al. (2020).
Example: A model might be tested for robustness by generating outputs for adversarially phrased questions designed to trick it. This ensures that the model does not just optimise for human preferences but also produces factually correct and logically coherent responses.
Conclusion
While human feedback plays a vital role in shaping the behaviours and outputs of LLMs, it is not a perfect or comprehensive evaluation metric. The subjectivity, bias, and limitations inherent in human feedback make it clear that this method should be supplemented with more objective measures to ensure that models are robust, fair, and aligned with factual truths.
As LLMs become more integrated into real-world applications, moving beyond human feedback as the gold standard will be essential for building models that not only align with human preferences but also meet objective criteria for accuracy, consistency, and ethical behaviour. A multi-dimensional approach to model evaluation is necessary to achieve the full potential of AI in addressing complex, real-world challenges.
References
1.?Christiano, P., et al. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems. https://arxiv.org/abs/1706.03741
2.?Zhou, K., et al. (2023). Assessing the Subjectivity of Human Evaluations. arXiv preprint. https://arxiv.org/abs/2303.00753
3. Gao, X., et al. (2023).Challenges in Reinforcement Learning with Human Feedback. arXiv preprint. https://arxiv.org/abs/2302.05688
4.?Welbl, J., et al. (2022). Human feedback misalignment in multilingual LLMs. International Conference on Learning Representations. https://arxiv.org/abs/2202.06439
5.?Bender, E. M., et al. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?* Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://dl.acm.org/doi/10.1145/3442188.3445922
6. Zhao, J., et al. (2017). Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. arXiv preprint. https://arxiv.org/abs/1707.09457
7.?Kiela, D., et al. (2021). The Importance of Objective Metrics in Large Language Model Evaluation. NeurIPS. https://arxiv.org/abs/2108.07258
8.?Mehri, S., & Eskenazi, M. (2020). USR: An Unsupervised and Reference-free Evaluation Metric for Dialog Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://arxiv.org/abs/2005.00462
9.?Clark, C., et al. (2021). Diverse Human Feedback for Robust LLM Evaluation. arXiv preprint. https://arxiv.org/abs/2109.07341
10.?Gururangan, S., et al. (2020). Adversarial Testing in LLMs. Association for Computational Linguistics. https://arxiv.org/abs/2007.01969
Disclaimer: The opinions and perspectives presented in this article are solely based on my independent research and analysis. They do not reflect or represent the official strategies, views, or internal policies of any organisation or company with which I am or have been affiliated.
?
Partner @ PwC | Lead - AI | Gen AI | Advanced Analytics | Decision Science
6 个月Very insightful Anish Agarwal
Business development & Strategy @XTS || Fujitsu 100 Winner || MBA in Marketing and Analytics ||
6 个月Great read! I found your points about the biases in human feedback really insightful. Combining it with other evaluation methods is definitely the way to go for a more accurate assessment of AI models. Thanks for sharing!
AI and Digital Transformation, Chemical Scientist, MBA.
6 个月Humans bring biases. Diverse objective testing exceeds human feedback limits.