Paper Walkthrough: Attention Guided Fixation Map Prediction with Explicit Image Priors

Paper Walkthrough: Attention Guided Fixation Map Prediction with Explicit Image Priors

Introduction

Identifying points or image regions that immediately attract human attention is crucial in building computational models that predict where people first look. These models have direct applications in various computer vision tasks.

Visual scene understanding use cases:

  • Video surveillance
  • Robot navigation
  • Visual search
  • Object tracking
  • Image composition
  • Image retargeting

among a few others


Let's understand the problem better:

Consider the image below (Fig:1) the first things we notice as humans are the two animals, even though there are other objects in the scene, such as the large rock in the background and the green plants in the foreground.

Fig:1


This is because our minds are trained to focus on the most "salient" (most noticeable or important) objects.

Based on where humans tend to dominantly look in the scenes, we can understand how humans perceive complex images based on which the engines can prioritize information for devising better vision-based algorithms.


Problem statement:

To develop an automated solution for fixation prediction that can effectively handle wide variability and operate at optimal computational costs, suitable for real-time deployments.


Why deep learning?

Human attention is influenced by numerous factors, including visual features (such as color, contrast, and motion), context, and even cognitive aspects. Machine learning algorithms, especially deep learning models, excel at identifying complex patterns and relationships within data that are not easily understood by humans.


Shortcomings of the existing solutions

  1. Limited generalizability: Existing solutions are trained on specific types of stimuli, such as natural scenes or simple geometric shapes, and their performance may degrade when applied to other types of stimuli.
  2. Limited understanding of high-level factors: Many existing solutions focus primarily on low-level visual features, such as color and texture, and do not consider the impact of higher-level factors, such as scene context and semantic information.
  3. Limited real-world applicability: Some existing fixation prediction models may not perform well in real-world environments, where factors such as lighting, noise, and camera calibration can affect gaze tracking.
  4. The current models are computationally heavy.


Research Contributions

  1. A scalable, lightweight solution for fixation prediction, that is well-suited for low-power devices.
  2. A novel two-stream fixation prediction network that exploits both deep learning and traditional visual features.
  3. The usage of explicit image priors improves the hit rate and reduces the false positives in the predictions.

Throughout this article, we will go over designing the model and assembling all the components part by part to create the complete pipeline.


Deciding the Model Input and Output:


Input:

  • An RGB image with a width and height of 256x256 pixels.
  • All three channels (As discussed above is crucial for capturing the color information needed to understand saliency).


Output:

  • A grayscale saliency map of size 240x240 pixels.
  • Highlighted "areas of interest" of observers when viewing an image.


Fig:2
Fig:3

Dataset

SALICON (API: https://github.com/NUS-VIP/salicon-api) was used to train the model, created from a subset of images from a parent dataset called MS COCO 2014.

SALICON is a dataset based on neurophysiological and psychophysical studies of peripheral vision, designed to simulate natural human viewing behavior. The aggregation of mouse trajectories from different viewers reflects the probability distribution of visual attention.

Details about the dataset:

  • Number of images: 20,000 (10,000 for training, 5,000 for validation, and 5,000 for testing)
  • Number of classes: 80
  • Image resolution: 640×480


Designing the Model Backbone Architecture:

In this paper, the goal is to design a lightweight architecture for fixation prediction.

While any lightweight deep learning backbone model capable of generating feature maps could be used, we have utilized an EfficientNet Lite model for this purpose.


Fig:4


We use EfficientNet as the backbone network to obtain deep feature representations due to its superior learning capabilities and compactness compared to models like ResNet and VGG.

We select feature maps from three stages of the CNN: stages 3, 5, and 7, with 40, 112, and 320 channels, respectively, and aggregate them.


Fig:5


With the skeleton model complete, we can now focus on improving its accuracy and learning potential.


Introduction of Priors:

To cover the broad spectrum of image stimuli, we generate four types of fundamental and complementary priors: Center, Boundary, Background, and Segmentation.

  • Center and Boundary Prior:

To capture the center bias in the input data, a 2D Gaussian map is used as the Center Prior. The inverse of this map serves as the Boundary Prior.

  • Background Prior:

The Background Prior is used to suppress the false detection of salient regions in the output saliency map.

  • Segmentation Priors:

Segmentation separates objects, boundaries, or structures within the image for more meaningful analysis.

We create a segmentation map using the well-known clustering-based segmentation algorithm SLIC, which is effective in learning the continuity and extent of salient regions. By employing multi-level segment maps, we capture wide variations in the input data.

We consider four levels of segmentation with cluster sizes of 5, 25, 45, and 65 to obtain four different segment maps.

These maps can function as a separate parallel stream from the deep feature extraction module.


Fig:6



Fig:7

Introduction of an Attention Module:

The Union Attention Module receives a concatenated input of feature maps from the Deep Feature Extraction Module and the set of image priors from the Image Prior Generation Module.

Since these concatenated feature maps originate from different data distributions, we employ both channel and spatial attention. Channel attention is used to emphasize the significant channels from the input feature representations.


Fig:8


Fig:9

Choosing a Loss Function:

We combine the binary cross-entropy (BCE), Intersection over Union (IoU), and L1 loss functions to form the loss function. Although initially designed for Salient Object Detection, we found it useful for Fixation Prediction as well.

By taking pixel intensity into account, this approach effectively highlights the most salient region concerning the surrounding area.

where y and y? denote the label and predicted probability of binary class c.


Results


Table 1 - Quantitative Analysis


Fig:10 - Qualitative Analysis

Ablation Study:

We start with a model that incorporates all the prior information and gradually remove each of the four image priors (Center, Boundary, Segmentation, and Background) to assess the influence of each individual prior on the fixation prediction task.

The table below shows the resulting configurations and their quantitative results at each stage. We selected a set of 500 challenging images from the SALICON dataset and compared the performances.

When all four priors were included, results showed improvements of

  • 5.97% in Similarity
  • 3.4% in CC
  • 27.5% in KL Divergence metrics

Table 2

We compare fixation maps with and without prior streams in the figure below. The fixation maps for the sample set of images highlight the significance of each prior in covering the spectrum of image stimuli, as seen in improved hit rates and reduced false positives.


Additionally, we note that the model's predictions with priors are considerably more consistent with the ground truth than those without priors.

Fig:11

Conclusion:

In this paper, we presented a novel and lightweight fixation prediction network that is robust to real-world data variations through the effective exploitation of prior knowledge. The inclusion of prior information helps capture deep semantics. The proposed method's improved prediction accuracy and low model complexity make it highly suitable for deployment on low-power devices.


References:

  • Min Seok Lee, WooSeok Shin, and Sung Won Han. TRACER: extreme attention guided salient object tracing network. CoRR, abs/2112.07380, 2021.
  • Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. Salicon: Saliency in context. In 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. CoRR, abs/1905.11946, 2019.
  • Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Fredo Durand. What do different evaluation metrics tell us about saliency models? IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • Sen Jia. Eml-net: An expandable multi-layer network for saliency prediction. ArXiv, 2018.
  • Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. A deep multi-level network for saliency prediction. 2016.



Find the paper here: https://computer-vision-in-the-wild.github.io/cvpr-2023/static/cvpr2023/accepted_papers/13/CameraReady/CVinW__23.pdf

Shreya Chowdhury

Technical Product Analyst @ Visa Inc.

8 个月

Congratulations!!!!?

回复
Yogesh T S

AI Grad Student @ Khoury College of Computer Sciences, Northeastern University | Ex-VISA | NLP | Gen AI

8 个月

Congratulations Anand!

回复
Tejas Udaya Shankar

MIS student at Texas A&M University | Ex IBMer

8 个月

Congratulations buddy! ?? Really fascinating and insightful work!!

回复
Samhita Irrinki

Product@AMD | Purdue MEM | Ex Cisco | Toastmaster

8 个月

Congratulations ??????

回复
Oorja Pal

Corporate Communications Specialist | NYU Grad Student | Marketing Strategist

8 个月

Congratulations, Anand! I’m so proud of you ??

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了