登录查看更多内容

How far can we go with ImageNet for Text-to-Image generation?

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2025年3月3日

Today's paper challenges the prevailing "bigger is better" paradigm in text-to-image generation by demonstrating that strategic data augmentation of small, well-curated datasets can match or outperform models trained on massive web-scraped collections. Using only ImageNet enhanced with well-designed text and image augmentations, the authors achieve superior performance while using just 1/10th the parameters and 1/1000th the training images compared to larger models like Stable Diffusion XL.

Method Overview

The method introduces a systematic approach to train text-to-image diffusion models using only ImageNet, a well-known dataset with 1.2 million images. The approach has two complementary axes: text-space augmentation and pixel-space augmentation.

For text-space augmentation, the authors convert ImageNet's simple class labels into semantically rich scene descriptions. Instead of using basic captions like "an image of a golden retriever," they use LLaVA (a vision-language model) to generate comprehensive captions that capture scene composition, spatial relationships, background elements, visual attributes, and interactions between elements.

For pixel-space augmentation, the authors introduce a structured CutMix framework that systematically combines concepts while preserving visual coherence. They define four precise augmentation patterns (Half-Mix, Quarter-Mix, Ninth-Mix, and Sixteenth-Mix), each designed to maintain visual coherence while introducing novel concept combinations. For example, in Quarter-Mix, a second image is resized to 50% of its side length and placed at one of four corners of the base image, occupying 25% of the final image. After augmentation, LLaVA is used to caption all generated images, ensuring semantic alignment between visual and textual representations.

During training, the method only uses augmented images at timesteps where the noisy image is sufficiently degraded that the artifacts no longer matter. This is controlled by a threshold parameter and a probability parameter that determines how frequently augmented images are used in training batches.

The combination of these techniques allows the model to learn complex visual concepts and compositions without requiring massive datasets. The authors train two model architectures (DiT-I and CAD-I) using this approach and compare their performance against state-of-the-art models trained on much larger datasets.

Results

The results show that models trained with the proposed method can match or exceed the performance of models trained on billion-scale datasets. Specifically:

On the GenEval benchmark, the CAD-I model achieved an overall score of 0.57, outperforming Stable Diffusion XL (0.55) and Stable Diffusion 3 Medium (0.56) when evaluated with extended prompts, despite using only 1/10th the parameters and 1/1000th the training images.
In terms of image quality, models trained with the proposed augmentations achieved FID scores of 8.52 for DiT-I and 6.62 for CAD-I on ImageNet, comparable to the typical score of 9 for models of similar size trained with class-conditional setups.
The augmented models were the only ones able to correctly follow prompts in zero-shot tasks on COCO, as evidenced by much improved CLIP scores (DiT-I from 13.16 to 24.85; CAD-I from 12.89 to 26.60).
Qualitative results showed that models trained with both text and image augmentations demonstrated improved concept understanding, composition abilities, and image quality compared to baseline models.

Conclusion

The paper shows that through careful visual and textual augmentation techniques, models trained on just 1.2M image-text pairs can match or exceed the performance of those trained on thousand-fold larger datasets. For more information please consult the full paper.

Congrats to the authors for their work!

Degeorge, Lucas, et al. "How far can we go with ImageNet for Text-to-Image generation?" arXiv preprint arXiv:2502.21318 (2025).

AI Paper of the Day

1,303 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

START: Self-taught Reasoner with Tools

2025年3月8日

START: Self-taught Reasoner with Tools

Today's paper introduces START (Self-taught Reasoner with Tools), a novel approach that enhances large language models'…
Token-Efficient Long Video Understanding for Multimodal LLMs

2025年3月7日

Token-Efficient Long Video Understanding for Multimodal LLMs

Today's paper introduces STORM (Spatiotemporal Token Reduction for Multimodal LLMs), a novel architecture for efficient…
Predictive Data Selection: The Data That Predicts Is the Data That Teaches

2025年3月6日

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Today's paper introduces PRESELECT, a novel approach for selecting high-quality data for language model pretraining…
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

2025年3月5日

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Today's paper introduces MultiAgentBench, a comprehensive benchmark designed to evaluate Large Language Model (LLM)…
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

2025年3月4日

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Today's paper introduces Phi-4-Mini and Phi-4-Multimodal, two compact yet powerful language models. Phi-4-Mini is a 3.
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

2025年3月2日

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Today's paper introduces DeltaBench, a comprehensive benchmark for evaluating the ability of Large Language Models…
Language Models' Factuality Depends on the Language of Inquiry

2025年3月1日

Language Models' Factuality Depends on the Language of Inquiry

Today's paper investigates an interesting limitation in multilingual language models (LMs): their inconsistency in…
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

2025年2月28日

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

Today's paper introduces REFUTE, a novel benchmark for evaluating language models' ability to falsify incorrect…

1 条评论
OpenAI GPT-4.5 System Card

2025年2月27日

OpenAI GPT-4.5 System Card

Today's paper introduces OpenAI GPT-4.5, the company's largest and most knowledgeable model to date.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

2025年2月26日

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Today's paper introduces SWE-RL, an approach that uses reinforcement learning to enhance large language models'…

See all articles

Method Overview

Results

Conclusion

AI Paper of the Day

1,303 位关注者

Vlad Bogolin的更多文章

START: Self-taught Reasoner with Tools

Token-Efficient Long Video Understanding for Multimodal LLMs

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Language Models' Factuality Depends on the Language of Inquiry

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

OpenAI GPT-4.5 System Card

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution