RULER: What's the Real Context Size of Your Long-Context Language Models?
This paper proposes RULER, a new benchmark to comprehensively evaluate the long-context modeling capabilities of large language models (LLMs). Existing benchmarks focus mainly on retrieval tasks, failing to test other important aspects of long-context understanding.
Method Overview
RULER contains four categories of tasks with flexible configurations:
1. Retrieval: This extends the needle-in-a-haystack (NIAH) test to evaluate retrieval capability with diverse types of "needles" (words, numbers, UUIDs), varying number of needles/values to retrieve, and ability to handle distractors.
2. Multi-hop Tracing: The variable tracking task tests the ability to trace co-referring entities across long contexts by following variable binding chains (e.g. X1=12345, X2=X1, X3=X2).
3. Aggregation: The common words extraction (CWE) and frequent words extraction (FWE) tasks require aggregating relevant information spanning the long input to identify the most common or frequent words.
4. Question Answering: Existing QA datasets are extended by inserting the question paragraphs into long distracting contexts to test long-context QA capability.
The synthetic nature of RULER allows flexibly controlling sequence length and task complexity by adjusting parameters like number of needles, hops in variable chains, word frequency distributions etc.
领英推荐
Results
Ten LLMs were evaluated on RULER across context lengths from 4K to 128K tokens. Despite achieving high accuracy on vanilla NIAH, all models exhibited severe performance degradation on other RULER tasks as context length increased. While claiming context lengths ≥32K, only 4 models could effectively handle 32K contexts based on a threshold.
Weighted average scores were used to rank models, with GPT-4, Command-R, Yi-34B and Mixtral being the top performers under different length distribution assumptions.
Conclusion
RULER reveals major shortcomings of existing LLMs in utilizing long context beyond simple retrieval. Detailed analysis of Yi-34B shows models struggling with ignoring distractors, tracking multi-hop connections, accurate aggregation etc. as context length and task complexity increase. For more information please consult the full paper.
Congrats to the authors for their work!
Hsieh, Cheng-Ping, et al. "RULER : What's the Real Context Size of Your Long-Context Language Models?" ArXiv Preprint ArXiv:2404.06654, 9 Apr. 2024.