Long Context Transfer from Language to Vision
Today's paper presents a method to enable large multimodal models (LMMs) to understand extremely long videos by leveraging the context length capabilities of the underlying language model. This is achieved through a method called "long context transfer" from text to vision.
Method Overview
The method starts by extending the context length of the language model backbone by continuing its pre-training on long text sequences. This extended language model is then used as the basis for the multimodal model, which is trained on image-text data using a unified encoding scheme called UniRes. UniRes represents both images and videos as a sequence of visual tokens, allowing the model to process videos as extended images during inference. UniRes is depicted below:
The training process involves two main steps. First, the language model is continually pre-trained on long text sequences to increase its context length. Techniques like increasing the base frequency of the relative positional encoding and optimizations like FlashAttention and Ring Attention are employed to enable training on sequences up to 224K tokens. In the second step, this extended language model is aligned with the vision modality using only short image-text data. The UniRes encoding scheme divides images into grids, encodes each grid using a vision transformer, and projects the features into the language model's input dimension.
During inference, videos are treated as extended images with each frame as a grid, allowing the model to process thousands of frames by leveraging the extended context length of the language model.
Results
The authors introduce a new benchmark called V-NIAH (Visual Needle-In-A-Haystack) to evaluate the visual context length of LMMs. Their model, called LongVA, achieves near-perfect performance on V-NIAH for up to 2000 frames, demonstrating its ability to retrieve visual information from extremely long contexts.
领英推荐
On the Video-MME benchmark, LongVA achieves state-of-the-art performance among 7B-scale models, with its performance improving as more video frames are sampled during inference. This highlights the effectiveness of the long context transfer approach.
Conclusion
The paper presents a simple yet effective method to enable LMMs to understand long videos by transferring the context length capabilities of the underlying language model to the vision modality. The proposed LongVA model demonstrates impressive performance on benchmarks involving extremely long videos, paving the way for more capable multimodal assistants. For more information, please consult the?full paper.
Congrats to the authors for their work!
Zhang, Peiyuan, et al. "Long Context Transfer from Language to Vision." arXiv preprint arXiv:2406.16852, 2024.