NVIDIA Labs developed SANA model weights and Gradio demo app published —Check out this amazing new Text to Image model by NVIDIA
Furkan G?zükara
PhD. Computer Engineer. Produces Content For FLUX, LoRA, Fine Tuning, Stable Diffusion, SDXL, Training, DreamBooth Training, Deep Fake, Voice Cloning, Text To Speech, Text To Image, Text To Video, Generative AI, LLMs
Official repo : https://github.com/NVlabs/Sana
1-Click Windows, RunPod, Massed Compute installers and free Kaggle notebook : https://www.patreon.com/posts/116474081
You can follow instructions on the repository to install and use locally. I tested on my Windows RTX 3060 and 3090 GPUs.
I have tested some speeds and VRAM usage too
Uses 9.5 GB VRAM but someone reported works good on 8 GB GPUs too
Default settings per image speeds as below
More info : https://nvlabs.github.io/Sana/
Works great on RunPod and Massed Compute as well (cloud)
Sana : Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
About Sana — Taken from official repo
We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024 × 1024 resolution image. Sana enables content creation at low cost.
Several Core Design Details for Efficiency
? Efficient Training and Inference Strategy: We propose automatic labeling and training strategies to improve text-image consistency. Multiple VLMs generate diverse re-captions, and a CLIPScore-based strategy selects high-CLIPScore captions to enhance convergence and alignment. Additionally, our Flow-DPM-Solver reduces inference steps from 28–50 to 14–20 compared to the Flow-Euler-Solver, with better performance.
Overall Performance
We compare Sana with the most advanced text-to-image diffusion models in Table 1. For 512 × 512 resolution, Sana-0.6 demonstrates a throughput that is 5× faster than PixArt-Σ, which has a similar model size, and significantly outperforms it in FID, Clip Score, GenEval, and DPG-Bench. For 1024 × 1024 resolution, Sana is considerably stronger than most models with ?B parameters and excels in inference latency. Our models achieve competitive performance even when compared to the most advanced large model FLUX-dev. For instance, while the accuracy on DPG-Bench is equivalent and slightly lower on GenEval, Sana-0.6B’s throughput is 39× faster, and Sana-1.6B is 23× faster.