How far can we go with ImageNet for Text-to-Image generation?
Today's paper challenges the prevailing "bigger is better" paradigm in text-to-image generation by demonstrating that strategic data augmentation of small, well-curated datasets can match or outperform models trained on massive web-scraped collections. Using only ImageNet enhanced with well-designed text and image augmentations, the authors achieve superior performance while using just 1/10th the parameters and 1/1000th the training images compared to larger models like Stable Diffusion XL.
Method Overview
The method introduces a systematic approach to train text-to-image diffusion models using only ImageNet, a well-known dataset with 1.2 million images. The approach has two complementary axes: text-space augmentation and pixel-space augmentation.
For text-space augmentation, the authors convert ImageNet's simple class labels into semantically rich scene descriptions. Instead of using basic captions like "an image of a golden retriever," they use LLaVA (a vision-language model) to generate comprehensive captions that capture scene composition, spatial relationships, background elements, visual attributes, and interactions between elements.
For pixel-space augmentation, the authors introduce a structured CutMix framework that systematically combines concepts while preserving visual coherence. They define four precise augmentation patterns (Half-Mix, Quarter-Mix, Ninth-Mix, and Sixteenth-Mix), each designed to maintain visual coherence while introducing novel concept combinations. For example, in Quarter-Mix, a second image is resized to 50% of its side length and placed at one of four corners of the base image, occupying 25% of the final image. After augmentation, LLaVA is used to caption all generated images, ensuring semantic alignment between visual and textual representations.
During training, the method only uses augmented images at timesteps where the noisy image is sufficiently degraded that the artifacts no longer matter. This is controlled by a threshold parameter and a probability parameter that determines how frequently augmented images are used in training batches.
The combination of these techniques allows the model to learn complex visual concepts and compositions without requiring massive datasets. The authors train two model architectures (DiT-I and CAD-I) using this approach and compare their performance against state-of-the-art models trained on much larger datasets.
Results
The results show that models trained with the proposed method can match or exceed the performance of models trained on billion-scale datasets. Specifically:
Conclusion
The paper shows that through careful visual and textual augmentation techniques, models trained on just 1.2M image-text pairs can match or exceed the performance of those trained on thousand-fold larger datasets. For more information please consult the full paper.
Congrats to the authors for their work!
Degeorge, Lucas, et al. "How far can we go with ImageNet for Text-to-Image generation?" arXiv preprint arXiv:2502.21318 (2025).