VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing
Today's paper introduces VideoGrain, a novel approach for multi-grained video editing that enables precise modifications at class, instance, and part levels. The method modulates space-time attention mechanisms in diffusion models to achieve fine-grained control over video content, addressing key challenges in text-to-region control and feature coupling that have limited previous approaches.
Method Overview
VideoGrain is a zero-shot approach that modulates both cross-attention and self-attention mechanisms in diffusion models to achieve fine-grained control over video content. The pipeline begins with DDIM Inversion to obtain noisy latent representations of the source video. During the denoising process, VideoGrain introduces Spatial-Temporal Layout-Guided Attention (ST-Layout Attn) to modulate both cross-attention and self-attention for precise text-to-region control while maintaining feature separation between regions.
In the cross-attention layer, the method modulates attention to amplify each local prompt's focus on its corresponding spatial region while suppressing attention to irrelevant areas. This ensures that textual features are accurately distributed to their intended regions, enabling precise control over which parts of the video are modified by specific text prompts.
In the self-attention layer, VideoGrain modulates attention to enhance intra-region awareness and reduce inter-region interference. This prevents feature coupling, a common issue in diffusion models where pixels from one region may attend to outside or similar regions, causing unwanted mixing of features. By ensuring each query attends only to its target region, VideoGrain maintains clear separation between different objects or instances in the video, allowing for distinct edits to be applied to each.
The modulation is applied in a unified manner by increasing positive attention scores for correct region-prompt pairs while decreasing negative scores for incorrect pairs. This approach is applied across both spatial and temporal dimensions to ensure consistency throughout the video.
Results
VideoGrain demonstrates superior performance compared to existing methods in multi-grained video editing tasks. Qualitative results show that the method can successfully edit videos at class, instance, and part levels, handling complex scenarios such as editing multiple human or animal instances into different characters, modifying backgrounds, and adding specific attributes to individual objects.
In quantitative evaluations, VideoGrain outperforms both Text-to-Image (T2I) and Text-to-Video (T2V) based methods across all metrics. It achieves higher CLIP-T scores (36.56 vs. next best 35.09), indicating better alignment with the target text prompt, and lower Warp-Err (1.42 vs. next best 2.05), showing better temporal consistency. In human evaluations, VideoGrain scores significantly higher in Edit-Accuracy (88.4%), Temporal-Consistency (85.0%), and Overall quality (83.0%) compared to the next best method.
Conclusion
VideoGrain introduces a novel approach to multi-grained video editing by modulating spatial-temporal cross- and self-attention mechanisms in diffusion models. By enhancing text-to-region control and maintaining feature separation between regions, the method enables precise editing at class, instance, and part levels. For more information please consult the full paper.
Congrats to the authors for their work!
Yang, Xiangpeng, et al. "Video Grain: Modulating Space-Time Attention for Multi-Grained Video Editing." ICLR 2025 Conference Paper, 2025.