Why Infinite Context Is Still Not Enough?
LLMs are on the race to enable out large contexts, Leading the way in terms of context processing capacity are models like Gemini 1.5 Pro with a million token limit, followed by Claude 2.1 and then GPT-4, GPT-4o. The promise of these models lies in their potential to process vast amounts of data — from lengthy videos and audio files to extensive lines of code and voluminous novels. This is a truly remarkable achievement and requires immense amount of research (MoE, Ring Attention , hardware ). The key capability that is often touted is that Model ABC is able to identify a single instance of something in a full video - that is very impressive, but in reality, it is much less useful and has applicability to a small set of use cases.
Going Beyond the Needle in the Haystack
Despite their advanced capabilities, current Large Language Models (LLMs) hit a critical stumbling block: they excel at pinpointing precise details within huge datasets—akin to finding a needle in a haystack—yet they struggle to generate long-form content from extensive context with coherence. This gap poses a substantial hurdle in applications that demand the creation of lengthy, well-articulated, and well-structured outputs.
Here are some use cases where not just long context but also coherent and long output is needed.
Simple Case Study:
let try a simply task of annotating and search through syllabus of AP Art History PDF . The PDF has exactly 250 artworks with images and descriptions of the artwork.
Task 1: Identify all the art works (i.e. 250 of them).
Gemini begins to list them but comes to a sudden halt after 134. Unfortunately, though the input can take 1M tokens, the maximum output is capped at 8K tokens. In practice, the output is often even shorter due to fine-tuning or a bias towards brief responses. Listing 250 artworks would not have taken more than 2K tokens, but the model has been trained to pause after a few - as it gets "tired." GPT-4o was no good either; it produces a partial list and stops there and when asked to continue messes up the numbering.
领英推荐
Task 2: Identify all the art works (i.e. 250 of them) with a rose or flower in them.
Again incorrect results with some hallucinations and some true positives.
The Take Away:
The long context is still a work in progress for many use cases that go beyond finding a single needle in the haystack. There should be better metrics, evaluations, and benchmarks to assess the quality of results for long context. RULER is a good attempt in this direction, but we need to think beyond the needle or multiple needles in the haystack - How can we construct accurate, long, coherent, and good quality answers? This may not be an easy task and will require significant engineering effort across the board. In the meantime, RAG or a divide-and-conquer approach is probably a better bet than large context queries.