【Introducing Sora: OpenAI's Groundbreaking Tool Transforms Text into Instant Video Magic】
Yaqi Zhang?????
Founder,Researcher,Author,Strategic Content Development,Member of IAMCR,Speaker of Upenn/PKU/KingSaudU/HKU,????? ?? ????? ??????? ,Global Communication,Super-Connector, Former Investor,Born to be Global
OpenAI's work is inspired by large language models (LLMs), which achieve broad capabilities by training on vast amounts of internet data using tokens to represent various modalities of text. They explore extending this approach to generative models for visual data, introducing Sora, which utilizes visual patches as tokens instead of text tokens used by LLMs. Patches, previously proven effective in visual models, are found to be highly scalable and efficient for training generative models on diverse videos and images.
Sora is a diffusion model trained to predict clean patches from noisy inputs, utilizing a diffusion transformer. Transformers have shown remarkable scaling properties across domains such as language modeling, computer vision, and image generation, empowering Sora with efficient and effective generative capabilities for visual data.
In their research, OpenAI discovers that diffusion transformers effectively scale as video models. They demonstrate this by presenting a comparison of video samples at various stages of training with fixed seeds and inputs, showing a notable improvement in sample quality as the computational resources for training increase.
Sora transforms videos into patches by compressing them into a low-dimensional latent space and decomposing them into spacetime patches. A video compression network is trained to reduce the dimensionality of visual data, enabling Sora to generate videos within this compressed latent space. Additionally, OpenAI develops a decoder model to map generated latents back to pixel space. Spacetime latent patches extracted from compressed videos serve as transformer tokens, allowing Sora to handle videos and images of varying resolutions, durations, and aspect ratios. During inference, the size of generated videos can be controlled by arranging randomly-initialized patches in a grid.
Traditionally, approaches to image and video generation involve resizing, cropping, or trimming videos to fit standard sizes. However, OpenAI finds that training on data at its original size offers multiple advantages.
领英推荐
One key advantage is the flexibility in sampling. Their model, Sora, can sample videos of different aspect ratios, including widescreen and vertical formats, allowing for content creation tailored to various devices directly at their native aspect ratios. This also facilitates rapid prototyping at lower resolutions before generating content at full resolution using the same model.
Additionally, OpenAI empirically demonstrates that training on videos at their native aspect ratios enhances composition and framing. Comparing Sora to a version of their model that crops all training videos to be square, a common practice in training generative models, they observe that videos generated by Sora exhibit improved framing, avoiding instances where subjects are partially out of view.
Moreover, training text-to-video generation systems necessitates a large volume of videos with corresponding text captions. OpenAI employs the re-captioning technique introduced in DALL·E 330 to videos. They first train a highly descriptive captioner model and then utilize it to produce text captions for all videos in their training set. OpenAI finds that training on highly descriptive video captions enhances text fidelity and overall video quality. Similar to DALL·E 3, they also leverage GPT to transform short user prompts into detailed captions, enabling Sora to generate high-quality videos that accurately align with user prompts.
Sora possesses the ability to create videos based on provided images and prompts. OpenAI has showcased example videos generated from DALL·E 231 and DALL·E 330 images. These videos depict various scenes, including a Shiba Inu dog sporting a beret and black turtleneck, a flat design style illustration of a diverse family of monsters, a realistic cloud spelling "SORA," and surfers navigating a tidal wave in an ornate historical hall.
Moreover, Sora can extend videos backward or forward in time, seamlessly looping them. Utilizing diffusion models like SDEdit, Sora can edit videos based on text prompts, transforming styles and environments. It can also interpolate between two input videos, creating smooth transitions, and generate images by arranging Gaussian noise patches in a spatial grid. Sora's training at scale enables it to simulate various aspects of the physical and digital world, including 3D consistency, long-range coherence, object permanence, interaction with the environment, and simulation of digital worlds like video games.
However, Sora still has limitations in accurately modeling certain interactions, as it does not simulate physics accurately in some cases, such as glass shattering. Despite these limitations, OpenAI believes that the current capabilities of Sora indicate a promising path toward the development of highly capable simulators of both physical and digital worlds, along with the entities inhabiting them.
To learn more, read OpenAI's technical report of Sora.