Paper Walkthrough: Attention Guided Fixation Map Prediction with Explicit Image Priors
Anand Hegde
ML Intern@Arm | Ex - Samsung Research | Patents and Papers | MS in Artificial Intelligence | Getting Machines to Learn Things |
Introduction
Identifying points or image regions that immediately attract human attention is crucial in building computational models that predict where people first look. These models have direct applications in various computer vision tasks.
Visual scene understanding use cases:
among a few others
Let's understand the problem better:
Consider the image below (Fig:1) the first things we notice as humans are the two animals, even though there are other objects in the scene, such as the large rock in the background and the green plants in the foreground.
This is because our minds are trained to focus on the most "salient" (most noticeable or important) objects.
Based on where humans tend to dominantly look in the scenes, we can understand how humans perceive complex images based on which the engines can prioritize information for devising better vision-based algorithms.
Problem statement:
To develop an automated solution for fixation prediction that can effectively handle wide variability and operate at optimal computational costs, suitable for real-time deployments.
Why deep learning?
Human attention is influenced by numerous factors, including visual features (such as color, contrast, and motion), context, and even cognitive aspects. Machine learning algorithms, especially deep learning models, excel at identifying complex patterns and relationships within data that are not easily understood by humans.
Shortcomings of the existing solutions
Research Contributions
Throughout this article, we will go over designing the model and assembling all the components part by part to create the complete pipeline.
Deciding the Model Input and Output:
Input:
Output:
Dataset
SALICON (API: https://github.com/NUS-VIP/salicon-api) was used to train the model, created from a subset of images from a parent dataset called MS COCO 2014.
SALICON is a dataset based on neurophysiological and psychophysical studies of peripheral vision, designed to simulate natural human viewing behavior. The aggregation of mouse trajectories from different viewers reflects the probability distribution of visual attention.
Details about the dataset:
Designing the Model Backbone Architecture:
In this paper, the goal is to design a lightweight architecture for fixation prediction.
While any lightweight deep learning backbone model capable of generating feature maps could be used, we have utilized an EfficientNet Lite model for this purpose.
We use EfficientNet as the backbone network to obtain deep feature representations due to its superior learning capabilities and compactness compared to models like ResNet and VGG.
We select feature maps from three stages of the CNN: stages 3, 5, and 7, with 40, 112, and 320 channels, respectively, and aggregate them.
With the skeleton model complete, we can now focus on improving its accuracy and learning potential.
领英推荐
Introduction of Priors:
To cover the broad spectrum of image stimuli, we generate four types of fundamental and complementary priors: Center, Boundary, Background, and Segmentation.
To capture the center bias in the input data, a 2D Gaussian map is used as the Center Prior. The inverse of this map serves as the Boundary Prior.
The Background Prior is used to suppress the false detection of salient regions in the output saliency map.
Segmentation separates objects, boundaries, or structures within the image for more meaningful analysis.
We create a segmentation map using the well-known clustering-based segmentation algorithm SLIC, which is effective in learning the continuity and extent of salient regions. By employing multi-level segment maps, we capture wide variations in the input data.
We consider four levels of segmentation with cluster sizes of 5, 25, 45, and 65 to obtain four different segment maps.
These maps can function as a separate parallel stream from the deep feature extraction module.
Introduction of an Attention Module:
The Union Attention Module receives a concatenated input of feature maps from the Deep Feature Extraction Module and the set of image priors from the Image Prior Generation Module.
Since these concatenated feature maps originate from different data distributions, we employ both channel and spatial attention. Channel attention is used to emphasize the significant channels from the input feature representations.
Choosing a Loss Function:
We combine the binary cross-entropy (BCE), Intersection over Union (IoU), and L1 loss functions to form the loss function. Although initially designed for Salient Object Detection, we found it useful for Fixation Prediction as well.
By taking pixel intensity into account, this approach effectively highlights the most salient region concerning the surrounding area.
where y and y? denote the label and predicted probability of binary class c.
Results
Ablation Study:
We start with a model that incorporates all the prior information and gradually remove each of the four image priors (Center, Boundary, Segmentation, and Background) to assess the influence of each individual prior on the fixation prediction task.
The table below shows the resulting configurations and their quantitative results at each stage. We selected a set of 500 challenging images from the SALICON dataset and compared the performances.
When all four priors were included, results showed improvements of
We compare fixation maps with and without prior streams in the figure below. The fixation maps for the sample set of images highlight the significance of each prior in covering the spectrum of image stimuli, as seen in improved hit rates and reduced false positives.
Additionally, we note that the model's predictions with priors are considerably more consistent with the ground truth than those without priors.
Conclusion:
In this paper, we presented a novel and lightweight fixation prediction network that is robust to real-world data variations through the effective exploitation of prior knowledge. The inclusion of prior information helps capture deep semantics. The proposed method's improved prediction accuracy and low model complexity make it highly suitable for deployment on low-power devices.
References:
Technical Product Analyst @ Visa Inc.
8 个月Congratulations!!!!?
AI Grad Student @ Khoury College of Computer Sciences, Northeastern University | Ex-VISA | NLP | Gen AI
8 个月Congratulations Anand!
MIS student at Texas A&M University | Ex IBMer
8 个月Congratulations buddy! ?? Really fascinating and insightful work!!
Product@AMD | Purdue MEM | Ex Cisco | Toastmaster
8 个月Congratulations ??????
Corporate Communications Specialist | NYU Grad Student | Marketing Strategist
8 个月Congratulations, Anand! I’m so proud of you ??