登录查看更多内容

Long Context Transfer from Language to Vision

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2024年6月27日

Today's paper presents a method to enable large multimodal models (LMMs) to understand extremely long videos by leveraging the context length capabilities of the underlying language model. This is achieved through a method called "long context transfer" from text to vision.

Method Overview

The method starts by extending the context length of the language model backbone by continuing its pre-training on long text sequences. This extended language model is then used as the basis for the multimodal model, which is trained on image-text data using a unified encoding scheme called UniRes. UniRes represents both images and videos as a sequence of visual tokens, allowing the model to process videos as extended images during inference. UniRes is depicted below:

The training process involves two main steps. First, the language model is continually pre-trained on long text sequences to increase its context length. Techniques like increasing the base frequency of the relative positional encoding and optimizations like FlashAttention and Ring Attention are employed to enable training on sequences up to 224K tokens. In the second step, this extended language model is aligned with the vision modality using only short image-text data. The UniRes encoding scheme divides images into grids, encodes each grid using a vision transformer, and projects the features into the language model's input dimension.

During inference, videos are treated as extended images with each frame as a grid, allowing the model to process thousands of frames by leveraging the extended context length of the language model.

Results

The authors introduce a new benchmark called V-NIAH (Visual Needle-In-A-Haystack) to evaluate the visual context length of LMMs. Their model, called LongVA, achieves near-perfect performance on V-NIAH for up to 2000 frames, demonstrating its ability to retrieve visual information from extremely long contexts.

Harriet Fiagbor 1 年前

Let Me Speak Freely? A Study on the Impact of Format…

Vlad Bogolin 1 个月前

Activating Latent Space in Large Language Models

Jim Taylor 1 个月前

On the Video-MME benchmark, LongVA achieves state-of-the-art performance among 7B-scale models, with its performance improving as more video frames are sampled during inference. This highlights the effectiveness of the long context transfer approach.

Conclusion

The paper presents a simple yet effective method to enable LMMs to understand long videos by transferring the context length capabilities of the underlying language model to the vision modality. The proposed LongVA model demonstrates impressive performance on benchmarks involving extremely long videos, paving the way for more capable multimodal assistants. For more information, please consult the?full paper.

Congrats to the authors for their work!

Zhang, Peiyuan, et al. "Long Context Transfer from Language to Vision." arXiv preprint arXiv:2406.16852, 2024.

Long Context Transfer from Language to Vision

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

Results

领英推荐

Conclusion

AI Paper of the Day

915 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

EPAC: Understanding Language Models and Transformers

Trends in LLMs - Do Longer Context Windows Matter?

Cost-Saving Strategies for Large Language Models(LLMs) - Part 1

Understanding VLMs

Tell, Don’t Show!: Language Guidance Eases Transfer Across Domains in Images and Videos

How to scale Large Language Models (LLMs) to infinite context?

Semantic Fingerprinting: Natural Language for the Finance Industry

In Defense of RAG in the Era of Long-Context Language Models

Leveraging LLM Tools for Beyond Language Tasks

ReFT: Representation Finetuning for Language Models

Method Overview

Results

领英推荐

Conclusion

AI Paper of the Day

915 位关注者

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

2024年10月16日

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

2024年10月15日

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

2024年10月14日

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

2024年10月13日

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

2024年10月12日

Aria: An Open Multimodal Native Mixture-of-Experts Model

2024年10月11日

Pixtral 12B

2024年10月10日

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

2024年10月9日

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

2024年10月8日

Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

2024年10月7日

社区洞察

其他会员也浏览了

EPAC: Understanding Language Models and Transformers

Trends in LLMs - Do Longer Context Windows Matter?

Cost-Saving Strategies for Large Language Models(LLMs) - Part 1

Understanding VLMs

Tell, Don’t Show!: Language Guidance Eases Transfer Across Domains in Images and Videos

How to scale Large Language Models (LLMs) to infinite context?

Semantic Fingerprinting: Natural Language for the Finance Industry

In Defense of RAG in the Era of Long-Context Language Models

Leveraging LLM Tools for Beyond Language Tasks

ReFT: Representation Finetuning for Language Models