登录查看更多内容

RULER: What's the Real Context Size of Your Long-Context Language Models?

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2024年4月11日

This paper proposes RULER, a new benchmark to comprehensively evaluate the long-context modeling capabilities of large language models (LLMs). Existing benchmarks focus mainly on retrieval tasks, failing to test other important aspects of long-context understanding.

Method Overview

RULER contains four categories of tasks with flexible configurations:

1. Retrieval: This extends the needle-in-a-haystack (NIAH) test to evaluate retrieval capability with diverse types of "needles" (words, numbers, UUIDs), varying number of needles/values to retrieve, and ability to handle distractors.

2. Multi-hop Tracing: The variable tracking task tests the ability to trace co-referring entities across long contexts by following variable binding chains (e.g. X1=12345, X2=X1, X3=X2).

3. Aggregation: The common words extraction (CWE) and frequent words extraction (FWE) tasks require aggregating relevant information spanning the long input to identify the most common or frequent words.

4. Question Answering: Existing QA datasets are extended by inserting the question paragraphs into long distracting contexts to test long-context QA capability.

The synthetic nature of RULER allows flexibly controlling sequence length and task complexity by adjusting parameters like number of needles, hops in variable chains, word frequency distributions etc.

领英推荐

Graph of Thoughts with LLMs; GPT Can Solve Math…

Danny Butvinik 1 年前

New Open Long-Context LLM; LLMs For Text Analysis;…

Danny Butvinik 1 年前

The Origination of Eight Major Methods For FineTuning…

Bruce Cottman 5 个月前

Results

Ten LLMs were evaluated on RULER across context lengths from 4K to 128K tokens. Despite achieving high accuracy on vanilla NIAH, all models exhibited severe performance degradation on other RULER tasks as context length increased. While claiming context lengths ≥32K, only 4 models could effectively handle 32K contexts based on a threshold.

Weighted average scores were used to rank models, with GPT-4, Command-R, Yi-34B and Mixtral being the top performers under different length distribution assumptions.

Conclusion

RULER reveals major shortcomings of existing LLMs in utilizing long context beyond simple retrieval. Detailed analysis of Yi-34B shows models struggling with ignoring distractors, tracking multi-hop connections, accurate aggregation etc. as context length and task complexity increase. For more information please consult the full paper.

Congrats to the authors for their work!

Hsieh, Cheng-Ping, et al. "RULER : What's the Real Context Size of Your Long-Context Language Models?" ArXiv Preprint ArXiv:2404.06654, 9 Apr. 2024.

AI Paper of the Day

1,321 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

2025年3月22日

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Today's paper introduces JARVIS-VLA, a novel approach for training Vision-Language-Action (VLA) models to play visual…
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

2025年3月21日

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Today's paper presents the a comprehensive survey on efficient reasoning for Large Language Models (LLMs). It addresses…
TULIP: Towards Unified Language-Image Pretraining

2025年3月20日

TULIP: Towards Unified Language-Image Pretraining

Today's paper introduces TULIP (Towards Unified Language-Image Pretraining), a novel approach to image-text contrastive…
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

2025年3月19日

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

Today's paper introduces Creation-MMBench, a novel benchmark designed to evaluate the creative capabilities of…
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

2025年3月18日

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

Today's paper introduces SPIN-Bench, a comprehensive benchmark designed to evaluate how well Large Language Models…
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

2025年3月17日

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Today's paper introduces ReCamMaster, a framework that enables re-shooting videos with new camera trajectories while…
CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

2025年3月16日

CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

Today's paper introduces CoSTA* (Cost-Sensitive Toolpath Agent), a novel approach for multi-turn image editing that…
OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

2025年3月15日

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Today's paper introduces OmniPaint, a unified framework for object-oriented image editing that reconceptualizes object…
Charting and Navigating Hugging Face's Model Atlas

2025年3月14日

Charting and Navigating Hugging Face's Model Atlas

Today's paper introduces the concept of a "model atlas" for navigating the vast landscape of publicly available neural…
VACE: All-in-One Video Creation and Editing

2025年3月13日

VACE: All-in-One Video Creation and Editing

Today's paper introduces VACE, an all-in-one model for video creation and editing. VACE unifies multiple video tasks…

See all articles

RULER: What's the Real Context Size of Your Long-Context Language Models?

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

领英推荐

Results

Conclusion

AI Paper of the Day

1,321 位关注者

Vlad Bogolin的更多文章

社区洞察

其他会员也浏览了

??Top ML Papers of the Week

Large Concept Models (LCMs): A New Paradigm in AI Language Processing

Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

How Irrelevant Retrieval Leads to Hallucination in RAG Models

A Guide to Training Your Own Language Model

Paper Review: Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

LangChain: A Revolution in Leveraging Large Language Models

Revealing the Gaps: Evaluating Large Language Models with New Benchmarks and Metrics

Small Language Models and the Multi Models Era

Method Overview

领英推荐

Results

Conclusion

AI Paper of the Day

1,321 位关注者

Vlad Bogolin的更多文章

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

TULIP: Towards Unified Language-Image Pretraining

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Charting and Navigating Hugging Face's Model Atlas

VACE: All-in-One Video Creation and Editing

社区洞察

其他会员也浏览了

??Top ML Papers of the Week

Large Concept Models (LCMs): A New Paradigm in AI Language Processing

Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

How Irrelevant Retrieval Leads to Hallucination in RAG Models

A Guide to Training Your Own Language Model

Paper Review: Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

LangChain: A Revolution in Leveraging Large Language Models

Revealing the Gaps: Evaluating Large Language Models with New Benchmarks and Metrics

Small Language Models and the Multi Models Era