Movie Gen: A Cast of Media Foundation Models
Credit: https://ai.meta.com/static-resource/movie-gen-research-paper

Movie Gen: A Cast of Media Foundation Models

Today's paper introduces Movie Gen, a set of foundation models for generating high-quality videos with synchronized audio. The models can create 1080p HD videos up to 16 seconds long with different aspect ratios, and enable capabilities like video personalization and precise editing. Movie Gen sets a new state-of-the-art on multiple media generation tasks including text-to-video synthesis, video editing, and audio generation.

Method Overview

Movie Gen consists of two main foundation models - Movie Gen Video for generating images and videos, and Movie Gen Audio for generating synchronized audio.

The Movie Gen Video model uses a Transformer architecture trained on a large dataset of image-text and video-text pairs. It generates videos in a compressed latent space learned by a Temporal Autoencoder (TAE), which allows efficient processing of long videos. The model is trained using a technique called Flow Matching, which teaches it to gradually transform random noise into the target video.

The training process involves multiple stages, starting with text-to-image generation on low resolution images, then jointly training on images and videos at progressively higher resolutions. The model is further fine-tuned on a curated set of high-quality videos to improve motion and aesthetics.

For inference, it uses an efficient sampling technique with a custom schedule that allows generating high-quality videos in just 50 steps. It also employs a text rewriting model to expand short user prompts into more detailed descriptions matching the training data distribution.

Additional capabilities like video personalization and editing are added through specialized post-training procedures. The personalization allows generating videos featuring a specific person based on a reference image. The editing capability enables precise modifications to videos based on text instructions.

The Movie Gen Audio model generates synchronized audio for videos, including sound effects and background music. It is trained on a large audio dataset and can produce high-quality 48kHz audio matching the visual content and mood of the video.

Results

Movie Gen outperforms prior state-of-the-art methods, including commercial systems, on text-to-video generation quality as judged by human evaluators. It also achieves superior performance on video personalization, editing, and audio generation tasks. The models can generate high-resolution 1080p videos up to 16 seconds long at 16 frames per second, with synchronized audio.

Conclusion

This paper introduces a powerful set of foundation models for high-quality video and audio generation, enabling new capabilities in media synthesis. The Movie Gen models advance the state-of-the-art across multiple tasks and demonstrate the potential of scaling up training data, model size, and compute for media generation. For more information please consult the?full paper.

Congrats to the authors for their work!

The Movie Gen team @ Meta. "Movie Gen : A Cast of Media Foundation Models." arXiv, 4 Oct. 2024.

要查看或添加评论,请登录

Vlad Bogolin的更多文章