A Face Mask Detector
Credit: Unsplash

A Face Mask Detector


TL;DR

Developed a computer vision architecture that is able to detect if individuals are wearing their face masks appropriately (with nose covered) at proximity. Model serves as a proof of concept that can aid businesses to monitor if their premises are in line with the national effort to slow the spread of covid-19.


Dashboard and Videos

A web app built using Streamlit (a python opensource library) showcases how the architecture works on 2 sample videos. Here's the link to the dashboard.


Motivation

Almost 2 years into the COVID -19 pandemic, we still find ourselves with the need to wear face masks in most indoor settings here in Melbourne (and in many other cities as well), partially due to the recent concerns with the spread of the new Omicron variant.

Cognizant that debate surrounding the wearing of face mask is often polarizing, nevertheless it is my personal opinion that masks could be helpful in preventing the spread of droplets (and therefore the virus) from a talking/sneezing/coughing individual in close proximity. Experiments like this one by researchers at UNSW seem to suggest so.

Unstructured data such as CCTV footage represent an under-utilized resource to garner greater insights and at the same time reduce the manpower required during the data collection process. Businesses could potentially implement this computer vision model on CCTV footage captured at their entrances and within the premises itself, to gather data for subsequent statistical analysis on how effective are their COVID safe policies are.

No alt text provided for this image

Figure 1: Example of an individual with face mask worn appropriately. A green bounding box encapsulates the detected face mask. The green circle is tagged to the predicted nose location in terms of x,y coordinates. A bounding box with a continuous blue line is drawn over the individual to indicate that he is wearing his mask in a way that his nose is covered. He is assigned an ID by the object tracker, which tracks his location and mask status during inference.



No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Figure 2: Examples of individuals covering their mouths with their hands/clothing without masks. No masks were detected within the bounding box surrounding each individual. Similarly, the green circle is tagged to the predicted nose location in terms of x,y coordinates. A bounding box with a dashed red line is drawn over the individual since in both instances, no masks were detected even though their faces were partially covered by their hands/clothing. An ID is assigned by the object tracker, which enables the recording of the fact that these individuals were unmasked at this particular frame (assuming a video input).



No alt text provided for this image
No alt text provided for this image

Figure 3: Examples of individuals wearing masks inappropriately, as the masks did not cover their noses. Although a mask was detected on the respective individuals, it was not worn in such a way that the nose was covered. This was determined by the fact that the predicted location of the nose (green circle) was not within the boundaries of the bounding box surrounding the detected mask. Therefore, the individuals were correctly flagged as unmasked even though a mask was detected.



A High Level Overview

The architecture consist of 2 separate components consisting of neural network (deep learning) models as follows:

  1. A mask detector utilizing a Custom trained detectron2 Faster R-CNN model This detector was developed to detect mask using transfer learning, by taking a pretrained Faster R-CNN model from the detectron2 library and training it on hundreds of instances.
  2. A human detector utilizing a Pretrained Keypoint R-CNN model from the detectron2 library This detector is pretrained on the COCO dataset and is available from the detectron2 library. This model is not just only able to detect people in images/video frames, but it can also detect the different keypoints of a person such as the eyes, nose, shoulders, ankles, etc.


Detecting Masks and People

During inference, the video input is fed frame by frame, first to the face mask detector then to the human detector. The face mask detector would pinpoint the x,y coordinates of detected masks in the image, and the detections can be then visualized by drawing a bounding box around each detected mask ( for example, a green bounding box is drawn around the detected face mask in Figure 1).

However, all that is known at this point is that mask(s) are detected in the video frame/image. Now, there is a need to know if there are people in the image and if these previously detected masks belong to them. This is where the second component, the Pretrained Keypoint R-CNN model comes into play. This model not only detects each person in the image, but also the location of the individual's nose. For visualization purposes, a bounding box with a continuous blue line is drawn around each detected individual that has worn a face mask appropriately and a green circle is drawn at the predicted locations of each individual's nose (Figure 1).


Determining if an individual is appropriately masked

Why don't a faster object detector such as YOLO be utilized for the detection of people? Since the Keypoint R-CNN model is already pretrained to detect noses among the many other human keypoints, that ability can be leverage to determine if the mask is worn in way that the nose is covered as well. This is done by checking if the predicted nose x,y coordinates lie within the boundaries of the detected mask's bounding boxes. Without the ability to know the x,y coordinates of the nose, individuals with masks that do not cover their noses would not be flagged as non-compliant. Individuals in Figure 3 are just some selected examples that illustrates this issue of poorly worn masks, whereby the model correctly flags them as 'unmasked' even though there was a mask detected within the bounding boxes corresponding to each of the individuals. Therefore, by including a pretrained human keypoint detector, the model was able to detect and record individuals who were non-compliant.


Tracking Individuals and their status

If its just still images that were being analyzed, what was covered so far in this article would be sufficient for subsequent analysis. However, since it is envisioned that video footage captured from CCTVs will be utilized, there is an added dimension of time involved. A good tracker would be needed to assign each individual captured in the video a specific ID number and track their movements together with their mask compliance status from frame to frame throughout the entire footage. This will enable the collection of important data such as: the number of individuals that were not masked, the location of the premises that they ventured to; the time of the day and the number of other individuals in close proximity. All these data generated by the model would be useful for any subsequent statistical analysis. By utilizing a tracker, info about each individual's 'mask' status from each captured frame would be able to be recorded and subsequently analyzed.

ByteTrack utilized in this project is a 2021 state-of-the-art (SOTA) multi-object tracker that has shown relatively good performance in minimizing the number of ID switching especially during occlusion. ID switching is defined as the same individual being assigned multiple IDs at different points in time of the video. This normally happens during occlusion whereby the tracked individual disappears from view because he/she was occluded by another object or person. The tracker would lose track of the individual for a number of frames and will then assign the same individual a new ID the moment he/she is again visible. With this capability, the model will be able to keep track of the 'mask' status of multiple individuals from frame to frame.


Limitations

Mask detection accuracy

The current iteration of the custom trained mask detector still has difficulty detecting masks in a number of instances, especially when individuals were further away from the camera. This could be due to the inherent difficulty of detecting smaller objects and the fact that most of the training data obtained were of individuals in close proximity to the camera rather than crowds of people wearing masks. Further training using a more diverse dataset having many instances of multiple individuals wearing masks from a distance, would be instrumental in improving the performance of the detector.


Lack of context

For some businesses such as restaurants, there are situations where it is perfectly fine for some individuals to be unmasked while others have to remain so. With the current implementation, the system would still flag unmasked diners having a meal as non-compliant, even though its would appear to any human that its completely reasonable to eat without a mask on. This would result in a great deal of false positives in the data, due to the fact that all the model does is to detect the locations of people and masks, but not classifying the activity of each individual, therefore it cannot differentiate in what context or activity it is definitely acceptable to not have a mask on.



Future Work

Utilizing Synthetic Data

Obtaining training data that was legally free to use and then labeling them manually was the most difficult and tedious part of the project. The lack of quantity and a diverse training dataset hampered the performance of the custom mask detector. One possible method of increasing the number of training instances relatively easily and quickly would be to generate synthetic data. This can be done by utilizing computer graphics software and/or game engines such as Blender and Unreal Engine respectively to generate photorealistic images for training. This approach can theoretically generate an infinite number of training instances quickly without the need for manual labeling. Furthermore, in order to help the model generalize well, features such as lighting effects, camera settings, colors, geometry, etc. can be easily modified in an automated manner using a scripting language such as python.


Human Action Recognition

By incorporating human action recognition in addition to object classification and localization, the model would be able to assign to each detected individual a predicted action/activity for that particular period of frames in time. However, this would definitely increase the required inference time as an additional dimension of time is taken into account when multiple frames are needed to classify an action. Nevertheless, this approach would definitely be useful for analyzing CCTV footage from F&B premises.

Siang Lin Yeow

Investor Relations | Mentor | Deep Tech Venture Building | Alumni Relations | Corporate Development | Ecosystem Builder. Opinions and Views Expressed are My Own

3 年

Sean G.D. Tan awesome

要查看或添加评论,请登录

社区洞察

其他会员也浏览了