登录查看更多内容

Movie Gen: A Cast of Media Foundation Models

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2024年10月4日

Today's paper introduces Movie Gen, a set of foundation models for generating high-quality videos with synchronized audio. The models can create 1080p HD videos up to 16 seconds long with different aspect ratios, and enable capabilities like video personalization and precise editing. Movie Gen sets a new state-of-the-art on multiple media generation tasks including text-to-video synthesis, video editing, and audio generation.

Method Overview

Movie Gen consists of two main foundation models - Movie Gen Video for generating images and videos, and Movie Gen Audio for generating synchronized audio.

The Movie Gen Video model uses a Transformer architecture trained on a large dataset of image-text and video-text pairs. It generates videos in a compressed latent space learned by a Temporal Autoencoder (TAE), which allows efficient processing of long videos. The model is trained using a technique called Flow Matching, which teaches it to gradually transform random noise into the target video.

The training process involves multiple stages, starting with text-to-image generation on low resolution images, then jointly training on images and videos at progressively higher resolutions. The model is further fine-tuned on a curated set of high-quality videos to improve motion and aesthetics.

For inference, it uses an efficient sampling technique with a custom schedule that allows generating high-quality videos in just 50 steps. It also employs a text rewriting model to expand short user prompts into more detailed descriptions matching the training data distribution.

Additional capabilities like video personalization and editing are added through specialized post-training procedures. The personalization allows generating videos featuring a specific person based on a reference image. The editing capability enables precise modifications to videos based on text instructions.

The Movie Gen Audio model generates synchronized audio for videos, including sound effects and background music. It is trained on a large audio dataset and can produce high-quality 48kHz audio matching the visual content and mood of the video.

Results

Movie Gen outperforms prior state-of-the-art methods, including commercial systems, on text-to-video generation quality as judged by human evaluators. It also achieves superior performance on video personalization, editing, and audio generation tasks. The models can generate high-resolution 1080p videos up to 16 seconds long at 16 frames per second, with synchronized audio.

Conclusion

This paper introduces a powerful set of foundation models for high-quality video and audio generation, enabling new capabilities in media synthesis. The Movie Gen models advance the state-of-the-art across multiple tasks and demonstrate the potential of scaling up training data, model size, and compute for media generation. For more information please consult the?full paper.

Congrats to the authors for their work!

The Movie Gen team @ Meta. "Movie Gen : A Cast of Media Foundation Models." arXiv, 4 Oct. 2024.

AI Paper of the Day

915 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

2024年10月16日

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Today's paper introduces VisRAG, a new approach to retrieval-augmented generation (RAG) that leverages vision-language…
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

2024年10月15日

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Today's paper introduces LOKI, a comprehensive benchmark for evaluating large multimodal models (LMMs) on synthetic…
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

2024年10月14日

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Today's paper introduces VITask, a new framework for adapting large vision language models (VLMs) to specific tasks. It…
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

2024年10月13日

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Today's paper investigates the challenges of using long-context large language models (LLMs) in retrieval-augmented…
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

2024年10月12日

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Today's paper introduces MLLM As ReTriever (MART), a new method for enhancing the performance of embodied agents in…
Aria: An Open Multimodal Native Mixture-of-Experts Model

2024年10月11日

Aria: An Open Multimodal Native Mixture-of-Experts Model

Today's paper introduces ARIA, an open multimodal native mixture-of-experts model with state-of-the-art performance…
Pixtral 12B

2024年10月10日

Pixtral 12B

Today's paper introduces Pixtral 12B, a 12-billion-parameter multimodal language model capable of understanding both…
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

2024年10月9日

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Today's paper introduces VideoGuide, a new framework for improving the temporal consistency of pretrained text-to-video…
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

2024年10月8日

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Today's paper explores the internal representations of large language models (LLMs) to better understand and detect…
Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

2024年10月7日

Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

Today's paper addresses a critical issue in Large Vision-Language Models (LVLMs): cross-modality parametric knowledge…

See all articles

Method Overview

Results

Conclusion

AI Paper of the Day

915 位关注者

Vlad Bogolin的更多文章

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Aria: An Open Multimodal Native Mixture-of-Experts Model

Pixtral 12B

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models