Meta Research announces MA-LMM:
Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Meta Research - MA-LMM

Meta Research announces MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

With the success of large language models (LLMs), combining vision models with LLMs and building vision-language foundation models has gained much more interest recently. This combination is paving the way for understanding both textual and visual data in a more integrated and comprehensive manner.

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Meta Researchers, in partnership with University of Maryland, College Park and University of Central Florida released a paper that proposes to process videos in an online manner and store past video information in a memory bank.

This allows their proposed model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits.

MA-LMM Researchers:

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim

  • Meta
  • University of Maryland, College Park
  • University of Central Florida

Paper: https://arxiv.org/abs/2404.05726

PDF: https://arxiv.org/pdf/2404.05726.pdf

GitHub: https://github.com/boheumd/MA-LMM

Project: https://boheumd.github.io/MA-LMM/

Hugging Face: https://huggingface.co/papers/2404.05726

"Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner." - MA-LMM Researchers
The researchers propose the long-term memory bank to autoregressively store and accumulate past video information, different from previous methods directly feeding the visual encoder’s outputs into the querying transformer. (b) GPU memory consumption v.s. video frame length of existing multimodal methods and MA-LMM during inference. Circle sizes represent the number of text tokens.

Existing LLM-based large multimodal models, such as Video-LLaMA and VideoChat, can only process a limited number of frames for short video understanding. The research team’s approach differs as they focus on processing videos in an online manner and storing past video information in a memory bank. This allows their model to reference historical video content for long-term analysis without exceeding LLMs’ context length constraints or GPU memory limits.

"As opposed to directly feeding visual encoder outputs to the querying transformer, we opt for an online processing approach that takes video frames sequentially and stores the video features in the proposed long-term memory bank. This strategy of sequentially processing video frames and leveraging a memory bank significantly reduces the GPU memory footprint for long video sequences. It also effectively addresses the constraints posed by the limited context length in LLMs. Our design provides a solution for long-term video understanding with large multimodal models with great advantages over prior approaches which consume huge GPU memory and require a large number of input text tokens." - MA-LMM Researchers

Their memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. They conducted extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning. Their model achieved state-of-the-art performances across multiple datasets.

This research represents a significant step forward in the field of video understanding and has potential applications in various domains.

MA-LMM Study Summary:

  • The researchers introduced a Memory-Augmented Large Multimodal Model (MA-LMM) for efficient and effective long-term video modeling.
  • MA-LMM adopts a structure similar to existing large multimodal models, which comprise a visual encoder to extract visual features, a querying transformer to align the visual and text embedding spaces, and a large language model.
  • MA-LMM can process videos in an online manner and store past video information in a memory bank.
  • This allows the model to reference historical video content for long-term analysis without exceeding LLMs’ context length constraints or GPU memory limits.
  • Their approach has achieved new state-of-the-art performances on various downstreaming video tasks, including long-term video understanding, video question answering, and video captioning.
  • The memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner.
  • They conducted extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning
  • Their model can achieve state-of-the-art performances across multiple datasets

Experiments:

  • The research team conducted extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning
  • Their model can achieve state-of-the-art performances across multiple datasets.

MM

Conclusion:

In this paper, the researchers introduce a long-term memory bank designed to augment current large multimodal models, equipping them with the capabilities to effectively and efficiently model long-term video sequences.

Their approach processes video frames sequentially and stores historical data in the memory bank, addressing LLMs’ context length limitation and GPU memory constraints posed by the long video inputs.

Their long-term memory bank is a plug-and-play module that can be easily integrated into existing large multimodal models in an off-the-shelf manner. Experiments on various tasks have demonstrated the superior advantages of their method.

They believe that MA-LMM offers valuable insights for future research in the long-term video understanding area.

#ai #genai #meta #aivideo

Pete Grett

GEN AI Evangelist | #TechSherpa | #LiftOthersUp

7 个月

Exciting advancements in multimodal models! Can't wait to see how this transforms video understanding. David Cronshaw

回复

要查看或添加评论,请登录

David Cronshaw的更多文章