登录查看更多内容

Meta Research announces MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

David Cronshaw

Sr. Product Manager @Disney Streaming | Co-Founder Chatmosa chatmosa.bsky.social | AI, Generative AI | Revenue Generation | Former Microsoft and T-Mobile | Co-Founder UltimateTV.com - Zap2it.com

发布日期: 2024年4月9日

With the success of large language models (LLMs), combining vision models with LLMs and building vision-language foundation models has gained much more interest recently. This combination is paving the way for understanding both textual and visual data in a more integrated and comprehensive manner.

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Meta Researchers, in partnership with University of Maryland, College Park and University of Central Florida released a paper that proposes to process videos in an online manner and store past video information in a memory bank.

This allows their proposed model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits.

MA-LMM Researchers:

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim

Meta
University of Maryland, College Park
University of Central Florida

Paper: https://arxiv.org/abs/2404.05726

PDF: https://arxiv.org/pdf/2404.05726.pdf

GitHub: https://github.com/boheumd/MA-LMM

Project: https://boheumd.github.io/MA-LMM/

Hugging Face: https://huggingface.co/papers/2404.05726

"Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner." - MA-LMM Researchers

The researchers propose the long-term memory bank to autoregressively store and accumulate past video information, different from previous methods directly feeding the visual encoder’s outputs into the querying transformer. (b) GPU memory consumption v.s. video frame length of existing multimodal methods and MA-LMM during inference. Circle sizes represent the number of text tokens.

Existing LLM-based large multimodal models, such as Video-LLaMA and VideoChat, can only process a limited number of frames for short video understanding. The research team’s approach differs as they focus on processing videos in an online manner and storing past video information in a memory bank. This allows their model to reference historical video content for long-term analysis without exceeding LLMs’ context length constraints or GPU memory limits.

"As opposed to directly feeding visual encoder outputs to the querying transformer, we opt for an online processing approach that takes video frames sequentially and stores the video features in the proposed long-term memory bank. This strategy of sequentially processing video frames and leveraging a memory bank significantly reduces the GPU memory footprint for long video sequences. It also effectively addresses the constraints posed by the limited context length in LLMs. Our design provides a solution for long-term video understanding with large multimodal models with great advantages over prior approaches which consume huge GPU memory and require a large number of input text tokens." - MA-LMM Researchers

Their memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. They conducted extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning. Their model achieved state-of-the-art performances across multiple datasets.

This research represents a significant step forward in the field of video understanding and has potential applications in various domains.

MA-LMM Study Summary:

The researchers introduced a Memory-Augmented Large Multimodal Model (MA-LMM) for efficient and effective long-term video modeling.
MA-LMM adopts a structure similar to existing large multimodal models, which comprise a visual encoder to extract visual features, a querying transformer to align the visual and text embedding spaces, and a large language model.
MA-LMM can process videos in an online manner and store past video information in a memory bank.
This allows the model to reference historical video content for long-term analysis without exceeding LLMs’ context length constraints or GPU memory limits.
Their approach has achieved new state-of-the-art performances on various downstreaming video tasks, including long-term video understanding, video question answering, and video captioning.
The memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner.
They conducted extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning
Their model can achieve state-of-the-art performances across multiple datasets

Experiments:

The research team conducted extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning
Their model can achieve state-of-the-art performances across multiple datasets.

Conclusion:

In this paper, the researchers introduce a long-term memory bank designed to augment current large multimodal models, equipping them with the capabilities to effectively and efficiently model long-term video sequences.

Their approach processes video frames sequentially and stores historical data in the memory bank, addressing LLMs’ context length limitation and GPU memory constraints posed by the long video inputs.

Their long-term memory bank is a plug-and-play module that can be easily integrated into existing large multimodal models in an off-the-shelf manner. Experiments on various tasks have demonstrated the superior advantages of their method.

They believe that MA-LMM offers valuable insights for future research in the long-term video understanding area.

#ai #genai #meta #aivideo

Pete Grett

GEN AI Evangelist | #TechSherpa | #LiftOthersUp

7 个月

Exciting advancements in multimodal models! Can't wait to see how this transforms video understanding. David Cronshaw

要查看或添加评论，请登录

David Cronshaw的更多文章

Team Efficiency with Microsoft’s New Autonomous Agents

2024年10月23日

Team Efficiency with Microsoft’s New Autonomous Agents

As I have mentioned in previous articles, autonomous and multi-agent agents, are automating tasks: Revolutionizing…

1 条评论
AI Audio in Entertainment: Key Takeaways from LA #TechWeek 2024

2024年10月21日

AI Audio in Entertainment: Key Takeaways from LA #TechWeek 2024

At LA #TechWeek 2024 on Thursday, Oct. 17, the session "AI Audio in Entertainment" brought together thought leaders in…
Job Disruptions and AI’s Impact on the Future of Work

2024年10月21日

Job Disruptions and AI’s Impact on the Future of Work

According to a New York Time Opinion article from Daron Acemoglu, a professor at the Massachusetts Institute of…

1 条评论
Rising Studios and Entertainment Tech Frontiers at LA #TechWeek

2024年10月16日

Rising Studios and Entertainment Tech Frontiers at LA #TechWeek

Unlocking the Future of Entertainment: "Insights from LA TechWeek’s Rising Studios and Entertainment Tech Frontiers"…

1 条评论
New Patent from Microsoft may have an audio-to-image generator

2024年10月15日

New Patent from Microsoft may have an audio-to-image generator

Microsoft has filed a patent application that promises to transform how we visualize and engage with spoken…

2 条评论
The Future of Media Companies in the Age of AI: Beyond Aggregation

2024年10月9日

The Future of Media Companies in the Age of AI: Beyond Aggregation

As artificial intelligence transforms the media landscape, traditional aggregation models of the Web 2.0 era are…
Emu3: Simplifying Multimodal AI with Next-Token Prediction

2024年10月8日

Emu3: Simplifying Multimodal AI with Next-Token Prediction

In a significant advancement toward more general AI systems, researchers at the Beijing Academy of Artificial…
Introducing OpenAI Canvas: A New Way to Collaborate with ChatGP

2024年10月7日

Introducing OpenAI Canvas: A New Way to Collaborate with ChatGP

OpenAI has just launched #Canvas, a new interface designed to enhance your writing and coding projects with ChatGPT…
Google NotebookLM is all about ME!

2024年10月5日

Google NotebookLM is all about ME!

I uploaded my resume to Google NotebookLM and created a conversational podcast - all about me! After all, it is ALL…
Reasons to Be Optimistic About the Entertainment Business

2024年9月27日

Reasons to Be Optimistic About the Entertainment Business

The culture industry has gone through a lot of upheaval, but it’s not all doom and gloom. From Bloomberg Five Reasons…

1 条评论

See all articles

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

MA-LMM Researchers:

MA-LMM Study Summary:

Experiments:

Conclusion:

David Cronshaw的更多文章

Team Efficiency with Microsoft’s New Autonomous Agents

AI Audio in Entertainment: Key Takeaways from LA #TechWeek 2024

Job Disruptions and AI’s Impact on the Future of Work

Rising Studios and Entertainment Tech Frontiers at LA #TechWeek

New Patent from Microsoft may have an audio-to-image generator

The Future of Media Companies in the Age of AI: Beyond Aggregation

Emu3: Simplifying Multimodal AI with Next-Token Prediction

Introducing OpenAI Canvas: A New Way to Collaborate with ChatGP

Google NotebookLM is all about ME!

Reasons to Be Optimistic About the Entertainment Business