登录查看更多内容

Paper Review: Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

Andrey Lukyanenko

Data Scientist / Machine Learning Engineer. Kaggle Competition Master, Notebooks Top-1.

发布日期: 2023年11月30日

+ 关注

Paper link

Project link

Producing quality segmentation masks in computer vision is a key challenge, especially for zero-shot segmentation across various image styles without specific training. The new method utilizes self-attention layers in pre-trained stable diffusion models, which have learned to recognize objects, to address this challenge. This approach involves an iterative merging process using KL divergence to combine attention maps into valid segmentation masks. It requires no additional training or language input and significantly outperforms previous state-of-the-art unsupervised zero-shot methods in pixel accuracy and mean IoU on the COCO-Stuff-27 dataset.

Method

Stable Diffusion Model Review

The stable diffusion model operates by adding and then removing Gaussian noise from images. It enhances traditional diffusion models with an encoder-decoder and U-Net design, incorporating self-attention and cross-attention mechanisms in its Transformer layers. This model compresses images into a latent space and then decompresses them, with diffusion processes occurring in this latent space.

The U-Net architecture consists of modular blocks with ResNet and Transformer layers. The self-attention layer within these blocks is hypothesized to contain inherent object grouping information, which can be used for producing segmentation masks without text inputs. Attention maps in the model capture semantic correlations and focus on object groups, with their resolution affecting the granularity of object grouping.

For experimentation, pre-trained models from Huggingface are adapted to extract attention maps for existing images. This involves using an unconditioned latent and running the diffusion process only once with a large time-step value. The goal is to aggregate weights from different resolutions and develop a method to merge all attention maps into valid segmentation.

DiffSeg

DiffSeg is a post-processing technique to convert attention tensors from a stable diffusion model into valid segmentation masks. It involves three main components: attention aggregation, iterative attention merging, and non-maximum suppression.

领英推荐

Roll Up Your Sleeves: 9 Data and Machine Learning…

Towards Data Science 10 个月前

Unraveling the Enigma of VAE

360DigiTMG 1 年前

CVPR Edition: Voxel51 Filtered Views Newsletter - June…

Voxel51 9 个月前

In attention aggregation, DiffSeg processes 16 attention tensors generated by the stable diffusion model from an input image. These tensors are of different resolutions, and the goal is to aggregate them into the highest resolution tensor. This involves upsampling the spatial dimensions of all attention maps to match the highest resolution and then aggregating them while treating the first two dimensions, which indicate reference locations, differently. The final aggregated attention tensor is normalized to ensure it represents a valid distribution.
Iterative attention merging aims to transform the aggregated attention tensor into a stack of object proposals, each potentially containing a single object or stuff category. At first, a sampling grid of evenly spaced anchor points is generated and corresponding attention maps are samples. The similarity between maps is measured using KL divergence, and maps with high similarity are merged iteratively. This process continues until a set of object proposals is obtained.
NMS is used to convert the list of object proposals into a valid segmentation mask. Each proposal is a probability distribution map, and the NMS process involves upsampling these maps to the original resolution and then assigning each pixel to the proposal with the highest probability at that location. The result is a final segmentation mask that represents the segmented objects in the input image.

Experiments

On the COCO benchmark, two k-means baselines, K-Means-C (using a constant number of clusters) and K-Means-S (using a specific number of clusters based on ground truth for each image), were included. Both variants outperformed previous methods, highlighting the effectiveness of using self-attention tensors. K-Means-S performed better than K-Means-C, indicating the importance of tuning the number of clusters for each image. However, DiffSeg, utilizing the same attention tensors, significantly outperformed both K-Means baselines, demonstrating its superior segmentation capability without the drawbacks of K-Means.

DiffSeg also significantly outperformed the previous state-of-the-art zero-shot method, ReCo, on COCO-Stuff-27, showing a 26% improvement in accuracy and 17% in mIoU for both 320 and 512 resolutions. On the Cityscapes self-driving segmentation task, DiffSeg matched prior works at a 320-resolution input and outperformed them at a 512-resolution input in both accuracy and mIoU. The performance on Cityscapes was more affected by input resolution due to the presence of smaller classes like light poles and traffic signs.

DiffSeg achieves this high level of performance in a purely zero-shot manner, without any language dependency or auxiliary images, enabling it to segment any image effectively.

In DiffSeg, several hyper-parameters play crucial roles. One key aspect is the aggregation weights used in the attention aggregation step, where attention maps of four different resolutions are combined. The aggregation weight for each map is proportional to its resolution, meaning higher-resolution maps are given more importance. This approach is based on the observation that higher-resolution maps, having smaller receptive fields relative to the original image, provide more detailed information.

High-resolution maps (e.g., 64×64) produce detailed but fractured segmentation, while lower-resolution maps (e.g., 32×32) offer more coherent segmentation but may over-segment details, particularly along edges. Very low resolutions result in overly simplified segmentation, merging the entire image into one object with the given hyper-parameter settings. The proportional aggregation strategy effectively balances consistency and detail in the segmentation process.

Adrian Zinovei

Tableau Consultant | Tableau Visionary 2025 | Tableau Ambassador 4x | Toronto TUG Leader

1 年

Guess most of people ignore this, some of the Like it without reading... But THIS is a really interesting and spot on master piece. Great job and keep it up. Thank you so much for your effort and sharing this! I'll do my best to read your blog.

1 次回应

要查看或添加评论，请登录

Andrey Lukyanenko的更多文章

Paper Review: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

2025年3月17日

Paper Review: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Paper Code Project Audio Flamingo 2 is an Audio-Language Model with advanced audio understanding and reasoning…

2 条评论
Paper Review: Large Language Diffusion Models

2025年3月10日

Paper Review: Large Language Diffusion Models

Paper Code Project LLaDA is a diffusion-based alternative to autoregressive models for LLMs. It models distributions…
Paper Review: NeoBERT: A Next-Generation BERT

2025年3月3日

Paper Review: NeoBERT: A Next-Generation BERT

Paper Code NeoBERT is a next-generation bidirectional encoder; it incorporates state-of-the-art architectural…
Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

2025年2月24日

Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper Project SigLIP 2 is a new family of multilingual vision-language encoders that improve upon the original SigLIP…
Paper Review: Goku: Flow Based Video Generative Foundation Models

2025年2月17日

Paper Review: Goku: Flow Based Video Generative Foundation Models

Paper Code Project Goku is a family of joint image-and-video generation models built on rectified flow Transformers…

1 条评论
Paper Review: Titans: Learning to Memorize at Test Time

2025年2月3日

Paper Review: Titans: Learning to Memorize at Test Time

Paper Titans is a new neural architecture that combines attention mechanisms with a long-term memory module to…

6 条评论
Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2025年1月27日

Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper Project Hugging Face page Code The DeepSeek team introduces two reasoning models, DeepSeek-R1-Zero and…
Paper Review: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

2025年1月13日

Paper Review: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper Project Code STAR improves real-world video super-resolution by addressing over-smoothing and temporal…
Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

2025年1月6日

Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

Paper Coconut (Chain of Continuous Thought) is a new reasoning paradigm for LLMs that operates in latent space, using…

1 条评论
Paper Review: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

2024年12月23日

Paper Review: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper Code Weights Blogpost ModernBERT introduces modern optimizations to BERT: trained on 2 trillion tokens with an…

1 条评论

See all articles

Paper Review: Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

Andrey Lukyanenko

Data Scientist / Machine Learning Engineer. Kaggle Competition Master, Notebooks Top-1.

Method

Stable Diffusion Model Review

DiffSeg

领英推荐

Experiments

Andrey Lukyanenko的更多文章

社区洞察

其他会员也浏览了

Essential Linear Algebra Concepts for Aspiring ML Engineers

Titans: A New Paradigm in AI Memory Management

Unsupervised Learning as Signals for Pairs Trading and StatArb

Support Vector Machines (SVM)

6 Best Open-Source Projects for Real-Time Face Recognition

Paper Review: σ-GPTs: A New Approach to Autoregressive Models

Exploring Image Compression with Nav Tech Electronics

The Rise of Automated Machine Learning

Top 10 Image Processing Applications Through Cloud APIs in?2025

Deep Learning for image search engines [code included]

Method

Stable Diffusion Model Review

DiffSeg

领英推荐

Experiments

Andrey Lukyanenko的更多文章

Paper Review: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Paper Review: Large Language Diffusion Models

Paper Review: NeoBERT: A Next-Generation BERT

Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper Review: Goku: Flow Based Video Generative Foundation Models

Paper Review: Titans: Learning to Memorize at Test Time

Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper Review: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

Paper Review: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

社区洞察

其他会员也浏览了

Essential Linear Algebra Concepts for Aspiring ML Engineers

Titans: A New Paradigm in AI Memory Management

Unsupervised Learning as Signals for Pairs Trading and StatArb

Support Vector Machines (SVM)

6 Best Open-Source Projects for Real-Time Face Recognition

Paper Review: σ-GPTs: A New Approach to Autoregressive Models

Exploring Image Compression with Nav Tech Electronics

The Rise of Automated Machine Learning

Top 10 Image Processing Applications Through Cloud APIs in?2025

Deep Learning for image search engines [code included]