Meta Research announces MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
David Cronshaw
Sr. Product Manager @Disney Streaming | Co-Founder Chatmosa chatmosa.bsky.social | AI, Generative AI | Revenue Generation | Former Microsoft and T-Mobile | Co-Founder UltimateTV.com - Zap2it.com
With the success of large language models (LLMs), combining vision models with LLMs and building vision-language foundation models has gained much more interest recently. This combination is paving the way for understanding both textual and visual data in a more integrated and comprehensive manner.
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Meta Researchers, in partnership with University of Maryland, College Park and University of Central Florida released a paper that proposes to process videos in an online manner and store past video information in a memory bank.
This allows their proposed model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits.
MA-LMM Researchers:
Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim
Project: https://boheumd.github.io/MA-LMM/
Hugging Face: https://huggingface.co/papers/2404.05726
"Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner." - MA-LMM Researchers
Existing LLM-based large multimodal models, such as Video-LLaMA and VideoChat, can only process a limited number of frames for short video understanding. The research team’s approach differs as they focus on processing videos in an online manner and storing past video information in a memory bank. This allows their model to reference historical video content for long-term analysis without exceeding LLMs’ context length constraints or GPU memory limits.
"As opposed to directly feeding visual encoder outputs to the querying transformer, we opt for an online processing approach that takes video frames sequentially and stores the video features in the proposed long-term memory bank. This strategy of sequentially processing video frames and leveraging a memory bank significantly reduces the GPU memory footprint for long video sequences. It also effectively addresses the constraints posed by the limited context length in LLMs. Our design provides a solution for long-term video understanding with large multimodal models with great advantages over prior approaches which consume huge GPU memory and require a large number of input text tokens." - MA-LMM Researchers
Their memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. They conducted extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning. Their model achieved state-of-the-art performances across multiple datasets.
This research represents a significant step forward in the field of video understanding and has potential applications in various domains.
MA-LMM Study Summary:
Experiments:
Conclusion:
In this paper, the researchers introduce a long-term memory bank designed to augment current large multimodal models, equipping them with the capabilities to effectively and efficiently model long-term video sequences.
Their approach processes video frames sequentially and stores historical data in the memory bank, addressing LLMs’ context length limitation and GPU memory constraints posed by the long video inputs.
Their long-term memory bank is a plug-and-play module that can be easily integrated into existing large multimodal models in an off-the-shelf manner. Experiments on various tasks have demonstrated the superior advantages of their method.
They believe that MA-LMM offers valuable insights for future research in the long-term video understanding area.
#ai #genai #meta #aivideo
GEN AI Evangelist | #TechSherpa | #LiftOthersUp
7 个月Exciting advancements in multimodal models! Can't wait to see how this transforms video understanding. David Cronshaw