登录查看更多内容

Navigating Cognitive Biases in Large Language Models: Insights from the COBBLER Benchmark

Sameer Maurya

Manager / Senior Analyst at Bank of America | Data scientist | GenAI | LLM | Risk Management

发布日期: 2024年5月28日

Introduction

Large Language Models (LLMs) have revolutionized natural language processing, enabling various applications from text generation to automatic evaluation. However, the reliability of these models as evaluators is questioned due to inherent cognitive biases. The study introduces the COBBLER benchmark, designed to measure these biases in LLM evaluation outputs, providing crucial insights into their robustness and alignment with human preferences.

Figure 1: COBBLER pipeline to evaluate the 15 popular LLMs that are instruction-tuned and trained with human feedback for their capabilities as unbiased automatic evaluators.

Abstract and Key Findings

The COBBLER benchmark evaluates 15 popular LLMs, revealing significant biases that question their effectiveness as unbiased evaluators. These models exhibited an average of 40% bias in their evaluations, with a notable misalignment between human and machine preferences, indicated by an average Rank-Biased Overlap (RBO) score of 49.6%.

Understanding the COBBLER Benchmark

Figure 1 of the paper outlines the COBBLER pipeline, showcasing how LLMs are evaluated for bias. The benchmark involves evaluating output responses through preference ranking, where other LLMs serve as evaluators. This setup helps in identifying various cognitive biases, providing a comprehensive view of model performance.

Implicit Biases in LLMs

Implicit biases are those inherent to the model-as-an-evaluator, observable without external modifications. Key implicit biases identified include:

Order Bias: The tendency of models to favor responses based on their presentation order rather than content quality. This bias was prevalent, especially in larger models, which often preferred the first or last-ordered responses.
Compassion Fade (Naming): Models showed different behaviors when real names were used instead of anonymous aliases, indicating a susceptibility to this bias. All models were influenced by real names, with larger models showing increased bias.
Egocentric Bias (Self-Preference): Models frequently preferred their own responses over others. This bias was notable in larger models and persisted even when real names were introduced, highlighting an inherent self-preference.
Salience Bias (Length): Evaluators showed a preference for responses based on length, with larger models favoring longer responses. Smaller models were less influenced by this bias, suggesting variability based on model size.

Figure 2: Overview of performance across all of the bias benchmarks categorized into 4 size groups.

Induced Biases: Testing Model Robustness

Induced biases require modifications to the primary prompt or additional information. These biases include:

Algolia 7 个月前

How to Train Large Language Models: A Survey of the…

TCS Digital Software & Solutions 11 个月前

Top examples of some of the best large language models…

Algolia 9 个月前

Bandwagon Effect: Evaluators were influenced by a fake majority preference, demonstrating susceptibility to collective opinion rather than independent judgment.
Attentional Bias (Distraction): Models were swayed by irrelevant information added to the evaluation setup. This bias highlighted the evaluators' vulnerability to distractions, impacting their decision-making quality.

Performance Analysis Across Models

The study provides a detailed performance analysis, categorizing models into different size groups. Models in the 10B size range were most affected by biases, particularly by the bandwagon effect and attentional bias. Larger models showed a strong preference for longer responses, aligning with previous findings. This analysis underscores the complex relationship between model size and bias susceptibility.

Human Preferences vs. Machine Evaluations

A critical aspect of the study is the comparison between human and machine preferences. The average RBO score of 49.6% indicates a significant gap in how humans and LLMs rank text outputs, raising concerns about the use of LLMs in applications requiring alignment with human judgment.

Discussion and Implications

The COBBLER benchmark reveals that despite advancements, LLMs exhibit significant cognitive biases, questioning their reliability as automatic evaluators. These findings emphasize the need for ongoing evaluation and refinement of LLMs to mitigate biases and ensure they align more closely with human preferences. The study highlights potential areas for future research, including developing more robust models and refining evaluation techniques to address these biases.

Conclusion

The COBBLER benchmark provides a crucial tool for understanding and measuring cognitive biases in LLMs. As these models continue to evolve and integrate into various applications, addressing their inherent biases is essential for their responsible and effective use. This study serves as a pivotal step towards developing more unbiased and human-aligned LLMs, paving the way for future advancements in artificial intelligence.

References

https://arxiv.org/pdf/2309.17012

Pete Grett

GEN AI Evangelist | #TechSherpa | #LiftOthersUp

4 个月

Insightful findings on biases in LLMs - let's keep refining for more reliable evaluations. Sameer Maurya

1 次回应

要查看或添加评论，请登录

查看全部

Navigating Cognitive Biases in Large Language Models: Insights from the COBBLER Benchmark

Sameer Maurya

Manager / Senior Analyst at Bank of America | Data scientist | GenAI | LLM | Risk Management

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Evaluation Metrics for Large Language Models and Retrieval-Augmented Generation Models

Unlocking the Full Potential of Large Language Models: A Guide to Advanced Prompt Engineering

Can Machines Be in Language?

Evolution of Language Models and Their Impact on Search

Your Definitive Guide to Natural Language Generation

Impact of Increasing Input Size on Attention Fidelity in Modified Transformer-based Models

Investigating Human-Like Patterns of Perception and Interpretation in Language Models (GPT-4o) Using the Rorschach Inkblot Test

Evaluating Large Language Models: Which Models Perform Best and Why ?

Advancing Knowledge Integration in Large Language Models (2 interesting RAG-related Research papers summarized)

Retrieval-Augmented Language Models: Enhancing Knowledge and Factual Accuracy (Summarizing selected Research Paper on RAG)

领英推荐

Balancing Risks and Opportunities in Open-Source Generative AI

2024年6月18日

Simplifying Continual Pre-training of Large Language Models

2024年6月11日

Revolutionizing Language Models with xLSTM: An Extended Long Short-Term Memory Approach

2024年6月4日

Decoding the Complexities of Compositional Generalization in Web Automation with Language Model Agents

2023年12月7日