Our New Paper: End-to-End Multitask Learning for Driver Gaze and Head Pose Estimation
Ibrahim Sobh - PhD
?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
80% of crashes involve driver distraction
Paper: "End-to-End Multitask Learning for Driver Gaze and Head Pose Estimation"; Electronic Imaging (EI 2020), Society for Imaging Science and Technology.
Abstract
Modern automobiles accidents occur mostly due to inattentive behavior of drivers, which is why driver’s gaze estimation is becoming a critical component in automotive industry. Gaze estimation has introduced many challenges due to the nature of the surrounding environment like changes in illumination, or driver’s head motion, partial face occlusion, or wearing eye decorations.
Most of previous work conducted in this field includes explicit extraction of hand-crafted features such as eye corners and pupil center to be used to estimate gaze, or appearance-based methods like Convolutional Neural Networks which implicitly extracts features from an image and directly map it to the corresponding gaze angle.
In this work, a multitask Convolutional Neural Network architecture is proposed to predict subject’s gaze yaw and pitch angles, along with the head pose as an auxiliary task, making the model robust to head pose variations, without needing any complex preprocessing or hand-crafted feature extraction.
The model achieves 78.2% accuracy in cross-subject testing (never seen in any pose), proving the model’s generalization capability and robustness to head pose variation.
Challenges facing driver gaze estimation
- Person independence: The ability to generalize on any subject.
- Variation in head pose: The ability to accurately detect gaze regardless of the orientation of the head.
- Subject wearing eye decorations eg. glasses
An End-to-End solution to driver’s gaze estimation using a single Convolutional Neural Network (CNN).
- End-to-End Multitask learning Network for driver Gaze and Head pose detection.
- No need for explicit feature extraction or multiple networks.
- Train a deep network to predict the subject’s head pose angle as an auxiliary task.
- The network being regression-based i.e. outputs gaze angles as continuous values, enables it to learn the spatial relation between gaze points, which is something a classification approach would fail to do.
- We cluster the predicted gaze values into classes, which is relevant in the driving scenario.
Dataset
- Columbia Gaze: Consists of 5880 high-resolution images (total 56 subjects)
- 5 horizontal head poses (0?,±15?,±30?), 7 horizontal gaze directions (0?,±5?,±10?,±15?) and 3 vertical gaze directions (0?,±10?)
- Dataset consists of a total of 21 gaze classes and 5 head poses, yielding 21 x 5 different images per one subject.
Setup and Pre-Processing
- Excluded 6 subjects from the dataset to be used for cross-subject testing.
- Initialized the weights of the first 4 layers in our architecture using the pre-trained weights of VGG face descriptor model.
- Data Augmented: random contrast, brightness, gaussian noise, etc ….; no affine transformations.
- Clustered the 21 Gaze points into 9 classes for practical reasons and to simplify the task
Experiments
Exp.1: Simple Classification
- Classified gaze into one of 9 regions/classes.
- No head pose aux. task
Exp.2: Simple Regression
- Predicted gaze Yaw & Pitch angles.
- Clustered the predicted values into 9 classes.
- No head pose aux. task
Exp.3: Feature Fusion (Regression)
- Pretrained a separate network to predict head pose.
- Used as a feature extractor and concatenated its resulting feature vectors with the feature vectors of the gaze network during training.
Exp.4: Multitask Learning (Regression)
- Predicted gaze Yaw & Pitch angles + head pose as an auxiliary task.
Results
It is clear that the multitask learning network achieved the best results even on subjects it has not seen before.
Saliency Maps
Visualizing what the network has learned using saliency maps, it is clear that the gaze detection network head focuses on the eye pupil, while the head pose detection network head focuses on the face contour and eyes.
Conclusions
- We propose an End-to-End Multitask learning solution to gaze estimation using a use a single CNN
Two ways are utilized to enhance the accuracy of our method.
- First, we use regression rather than classification approach. This comes from the fact that there is an underlying correlation between gaze regions that a pure classification approach would fail to capture.
- Second, we use Multitask Learning where the network is trained to predict the subject's head pose angle as an auxiliary task along with its main task of predicting gaze.
Since the appearance of the eye varies with the head pose, training one network on both gaze and head pose estimation tasks simultaneously has proven to enhance the results.
Best Regards
Data Analytics Consultant | x2 AWS Certified | Msc.Student
5 年Great work, Ibrahim Sobh - PhD
CV/ML Engineer @ Anyline | MSc. @ Johannes Kepler Universit?t
5 年Great Job Mahmoud Ewaisha, Ibrahim Sobh - PhD?:)?
Senior Data Scientist, conseils et formation
5 年Congrats ! Nice work