Our New Paper: End-to-End Multitask Learning for Driver Gaze and Head Pose Estimation

Our New Paper: End-to-End Multitask Learning for Driver Gaze and Head Pose Estimation

80% of crashes involve driver distraction


Paper: "End-to-End Multitask Learning for Driver Gaze and Head Pose Estimation"; Electronic Imaging (EI 2020), Society for Imaging Science and Technology.

Abstract

Modern automobiles accidents occur mostly due to inattentive behavior of drivers, which is why driver’s gaze estimation is becoming a critical component in automotive industry. Gaze estimation has introduced many challenges due to the nature of the surrounding environment like changes in illumination, or driver’s head motion, partial face occlusion, or wearing eye decorations.

Most of previous work conducted in this field includes explicit extraction of hand-crafted features such as eye corners and pupil center to be used to estimate gaze, or appearance-based methods like Convolutional Neural Networks which implicitly extracts features from an image and directly map it to the corresponding gaze angle.

In this work, a multitask Convolutional Neural Network architecture is proposed to predict subject’s gaze yaw and pitch angles, along with the head pose as an auxiliary task, making the model robust to head pose variations, without needing any complex preprocessing or hand-crafted feature extraction.

The model achieves 78.2% accuracy in cross-subject testing (never seen in any pose), proving the model’s generalization capability and robustness to head pose variation.


No alt text provided for this image


Challenges facing driver gaze estimation

  • Person independence: The ability to generalize on any subject.
  • Variation in head pose: The ability to accurately detect gaze regardless of the orientation of the head.
  • Subject wearing eye decorations eg. glasses
No alt text provided for this image


An End-to-End solution to driver’s gaze estimation using a single Convolutional Neural Network (CNN).

  • End-to-End Multitask learning Network for driver Gaze and Head pose detection.
  • No need for explicit feature extraction or multiple networks.
  • Train a deep network to predict the subject’s head pose angle as an auxiliary task.
  • The network being regression-based i.e. outputs gaze angles as continuous values, enables it to learn the spatial relation between gaze points, which is something a classification approach would fail to do.
  • We cluster the predicted gaze values into classes, which is relevant in the driving scenario.


Dataset

  • Columbia Gaze: Consists of 5880 high-resolution images (total 56 subjects)
  • 5 horizontal head poses (0?,±15?,±30?), 7 horizontal gaze directions (0?,±5?,±10?,±15?) and 3 vertical gaze directions (0?,±10?)
  • Dataset consists of a total of 21 gaze classes and 5 head poses, yielding 21 x 5 different images per one subject.
No alt text provided for this image


Setup and Pre-Processing

  • Excluded 6 subjects from the dataset to be used for cross-subject testing.
  • Initialized the weights of the first 4 layers in our architecture using the pre-trained weights of VGG face descriptor model.
  • Data Augmented: random contrast, brightness, gaussian noise, etc ….; no affine transformations.
  • Clustered the 21 Gaze points into 9 classes for practical reasons and to simplify the task
No alt text provided for this image


Experiments

Exp.1: Simple Classification

  • Classified gaze into one of 9 regions/classes.
  • No head pose aux. task

Exp.2: Simple Regression

  • Predicted gaze Yaw & Pitch angles.
  • Clustered the predicted values into 9 classes.
  • No head pose aux. task

Exp.3: Feature Fusion (Regression)

  • Pretrained a separate network to predict head pose.
  • Used as a feature extractor and concatenated its resulting feature vectors with the feature vectors of the gaze network during training.

Exp.4: Multitask Learning (Regression)

  • Predicted gaze Yaw & Pitch angles + head pose as an auxiliary task.


Results

No alt text provided for this image


It is clear that the multitask learning network achieved the best results even on subjects it has not seen before.


Saliency Maps

Visualizing what the network has learned using saliency maps, it is clear that the gaze detection network head focuses on the eye pupil, while the head pose detection network head focuses on the face contour and eyes.

No alt text provided for this image


Conclusions

  • We propose an End-to-End Multitask learning solution to gaze estimation using a use a single CNN 

Two ways are utilized to enhance the accuracy of our method.

  • First, we use regression rather than classification approach. This comes from the fact that there is an underlying correlation between gaze regions that a pure classification approach would fail to capture.
  • Second, we use Multitask Learning where the network is trained to predict the subject's head pose angle as an auxiliary task along with its main task of predicting gaze.

Since the appearance of the eye varies with the head pose, training one network on both gaze and head pose estimation tasks simultaneously has proven to enhance the results.


Best Regards

Mostafa Nasser

Data Analytics Consultant | x2 AWS Certified | Msc.Student

5 年

Great work, Ibrahim Sobh - PhD

回复
Nader Essam

CV/ML Engineer @ Anyline | MSc. @ Johannes Kepler Universit?t

5 年
Audrey Quessada Vial, Ph.D

Senior Data Scientist, conseils et formation

5 年

Congrats ! Nice work

要查看或添加评论,请登录

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了