Paper Review: Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion
Andrey Lukyanenko
Data Scientist / Machine Learning Engineer. Kaggle Competition Master, Notebooks Top-1.
Producing quality segmentation masks in computer vision is a key challenge, especially for zero-shot segmentation across various image styles without specific training. The new method utilizes self-attention layers in pre-trained stable diffusion models, which have learned to recognize objects, to address this challenge. This approach involves an iterative merging process using KL divergence to combine attention maps into valid segmentation masks. It requires no additional training or language input and significantly outperforms previous state-of-the-art unsupervised zero-shot methods in pixel accuracy and mean IoU on the COCO-Stuff-27 dataset.
Method
Stable Diffusion Model Review
The stable diffusion model operates by adding and then removing Gaussian noise from images. It enhances traditional diffusion models with an encoder-decoder and U-Net design, incorporating self-attention and cross-attention mechanisms in its Transformer layers. This model compresses images into a latent space and then decompresses them, with diffusion processes occurring in this latent space.
The U-Net architecture consists of modular blocks with ResNet and Transformer layers. The self-attention layer within these blocks is hypothesized to contain inherent object grouping information, which can be used for producing segmentation masks without text inputs. Attention maps in the model capture semantic correlations and focus on object groups, with their resolution affecting the granularity of object grouping.
For experimentation, pre-trained models from Huggingface are adapted to extract attention maps for existing images. This involves using an unconditioned latent and running the diffusion process only once with a large time-step value. The goal is to aggregate weights from different resolutions and develop a method to merge all attention maps into valid segmentation.
DiffSeg
DiffSeg is a post-processing technique to convert attention tensors from a stable diffusion model into valid segmentation masks. It involves three main components: attention aggregation, iterative attention merging, and non-maximum suppression.
领英推荐
Experiments
On the COCO benchmark, two k-means baselines, K-Means-C (using a constant number of clusters) and K-Means-S (using a specific number of clusters based on ground truth for each image), were included. Both variants outperformed previous methods, highlighting the effectiveness of using self-attention tensors. K-Means-S performed better than K-Means-C, indicating the importance of tuning the number of clusters for each image. However, DiffSeg, utilizing the same attention tensors, significantly outperformed both K-Means baselines, demonstrating its superior segmentation capability without the drawbacks of K-Means.
DiffSeg also significantly outperformed the previous state-of-the-art zero-shot method, ReCo, on COCO-Stuff-27, showing a 26% improvement in accuracy and 17% in mIoU for both 320 and 512 resolutions. On the Cityscapes self-driving segmentation task, DiffSeg matched prior works at a 320-resolution input and outperformed them at a 512-resolution input in both accuracy and mIoU. The performance on Cityscapes was more affected by input resolution due to the presence of smaller classes like light poles and traffic signs.
DiffSeg achieves this high level of performance in a purely zero-shot manner, without any language dependency or auxiliary images, enabling it to segment any image effectively.
In DiffSeg, several hyper-parameters play crucial roles. One key aspect is the aggregation weights used in the attention aggregation step, where attention maps of four different resolutions are combined. The aggregation weight for each map is proportional to its resolution, meaning higher-resolution maps are given more importance. This approach is based on the observation that higher-resolution maps, having smaller receptive fields relative to the original image, provide more detailed information.
High-resolution maps (e.g., 64×64) produce detailed but fractured segmentation, while lower-resolution maps (e.g., 32×32) offer more coherent segmentation but may over-segment details, particularly along edges. Very low resolutions result in overly simplified segmentation, merging the entire image into one object with the given hyper-parameter settings. The proportional aggregation strategy effectively balances consistency and detail in the segmentation process.
Tableau Consultant | Tableau Visionary 2025 | Tableau Ambassador 4x | Toronto TUG Leader
1 年Guess most of people ignore this, some of the Like it without reading... But THIS is a really interesting and spot on master piece. Great job and keep it up. Thank you so much for your effort and sharing this! I'll do my best to read your blog.