Diffusion Transformer and Its Applications, Including OpenAI's Sora
Frank Morales Aguilera, BEng, MEng, SMIEEE
Boeing Associate Technical Fellow /Engineer /Scientist /Inventor /Cloud Solution Architect /Software Developer /@ Boeing Global Services
Diffusion Transformer and Its Applications, Including OpenAI's Sora
Introduction
Diffusion Transformer (DiT) is a novel class of diffusion models that leverages the transformer architecture[1]. Developed by William Peebles at UC Berkeley and Saining Xie at New York University[2], DiT aims to improve the performance of diffusion models by replacing the commonly used U-Net backbone with a transformer[1]. One of the most notable applications of DiT[1b,1c] is in OpenAI's SORA, a text-to-video model[3].
Architecture of DiT
The architecture of DiT is similar to a standard Vision Transformer (ViT), with a few critical modifications[1]. The model first inputs spatial representations through a network layer, converting spatial inputs into a sequence of tokens[1]. Standard ViT-based positional embeddings are applied to all input tokens[1]. The input tokens are processed by a series of transformer blocks[1].
In addition to the noise image input, diffusion models sometimes process additional, conditional information, such as noise time steps, class labels, and natural language[1]. DiT explored four variants of transformer blocks, each handling conditional inputs in different ways[1].
The Diffusion Transformer (DiT) uses the transformer architecture in diffusion models. Here's a step-by-step breakdown of how it works:
This process allows the DiT to model complex data distributions effectively and has led to significant improvements in the performance of diffusion models.
Performance and Scalability of DiT
DiT models have demonstrated impressive scalability properties. The scalability of DiT is analyzed through the lens of forward pass complexity as measured by Gflops[1,1d]. It was found that DiTs with higher Gflops—through increased transformer depth/width or increased number of input tokens—consistently have lower[1d] Frechet Inception Distance (FID)[1].
In addition to good scalability properties, DiT models have outperformed all prior diffusion models on the class-conditional ImageNet 512×512 and 256×256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter[1].
领英推荐
OpenAI'sOpenAI's SORA: An Application of DiT
OpenAI's SORA is a generative AI model that can create realistic and imaginative scenes from text instructions[3]. It uses a Diffusion Transformer (DiT) architecture, which combines transformer and diffusion models[3].
SORA can generate videos up to a minute long while maintaining visual quality and adherence to the user's prompt[3]. It leverages a transformer architecture that operates on spacetime patches of video and image latent codes[3].
Conclusion
The development of DiT represents a significant advancement in the field of diffusion models. By replacing the U-Net backbone in latent diffusion models (LDMs) with a transformer, DiT has demonstrated improved performance and scalability[1]. This innovative approach opens up new possibilities for applying transformers in diffusion models and beyond[4].
The application of DiT in OpenAI's SORA showcases the potential of this technology in generating high-quality, realistic videos from text instructions[3]. As AI continues to evolve, we can expect to see more innovative applications of DiT and similar technologies in the future.
References
3.-?Sora ( openai.com )
Senior System Reliability Engineer / Platform Engineer
9 个月Is it too early to claim 2024 as the year of Diffusion Transformer?