A Face Mask Detector
TL;DR
Developed a computer vision architecture that is able to detect if individuals are wearing their face masks appropriately (with nose covered) at proximity. Model serves as a proof of concept that can aid businesses to monitor if their premises are in line with the national effort to slow the spread of covid-19.
Dashboard and Videos
A web app built using Streamlit (a python opensource library) showcases how the architecture works on 2 sample videos. Here's the link to the dashboard.
Motivation
Almost 2 years into the COVID -19 pandemic, we still find ourselves with the need to wear face masks in most indoor settings here in Melbourne (and in many other cities as well), partially due to the recent concerns with the spread of the new Omicron variant.
Cognizant that debate surrounding the wearing of face mask is often polarizing, nevertheless it is my personal opinion that masks could be helpful in preventing the spread of droplets (and therefore the virus) from a talking/sneezing/coughing individual in close proximity. Experiments like this one by researchers at UNSW seem to suggest so.
Unstructured data such as CCTV footage represent an under-utilized resource to garner greater insights and at the same time reduce the manpower required during the data collection process. Businesses could potentially implement this computer vision model on CCTV footage captured at their entrances and within the premises itself, to gather data for subsequent statistical analysis on how effective are their COVID safe policies are.
Figure 1: Example of an individual with face mask worn appropriately. A green bounding box encapsulates the detected face mask. The green circle is tagged to the predicted nose location in terms of x,y coordinates. A bounding box with a continuous blue line is drawn over the individual to indicate that he is wearing his mask in a way that his nose is covered. He is assigned an ID by the object tracker, which tracks his location and mask status during inference.
Figure 2: Examples of individuals covering their mouths with their hands/clothing without masks. No masks were detected within the bounding box surrounding each individual. Similarly, the green circle is tagged to the predicted nose location in terms of x,y coordinates. A bounding box with a dashed red line is drawn over the individual since in both instances, no masks were detected even though their faces were partially covered by their hands/clothing. An ID is assigned by the object tracker, which enables the recording of the fact that these individuals were unmasked at this particular frame (assuming a video input).
Figure 3: Examples of individuals wearing masks inappropriately, as the masks did not cover their noses. Although a mask was detected on the respective individuals, it was not worn in such a way that the nose was covered. This was determined by the fact that the predicted location of the nose (green circle) was not within the boundaries of the bounding box surrounding the detected mask. Therefore, the individuals were correctly flagged as unmasked even though a mask was detected.
A High Level Overview
The architecture consist of 2 separate components consisting of neural network (deep learning) models as follows:
领英推荐
Detecting Masks and People
During inference, the video input is fed frame by frame, first to the face mask detector then to the human detector. The face mask detector would pinpoint the x,y coordinates of detected masks in the image, and the detections can be then visualized by drawing a bounding box around each detected mask ( for example, a green bounding box is drawn around the detected face mask in Figure 1).
However, all that is known at this point is that mask(s) are detected in the video frame/image. Now, there is a need to know if there are people in the image and if these previously detected masks belong to them. This is where the second component, the Pretrained Keypoint R-CNN model comes into play. This model not only detects each person in the image, but also the location of the individual's nose. For visualization purposes, a bounding box with a continuous blue line is drawn around each detected individual that has worn a face mask appropriately and a green circle is drawn at the predicted locations of each individual's nose (Figure 1).
Determining if an individual is appropriately masked
Why don't a faster object detector such as YOLO be utilized for the detection of people? Since the Keypoint R-CNN model is already pretrained to detect noses among the many other human keypoints, that ability can be leverage to determine if the mask is worn in way that the nose is covered as well. This is done by checking if the predicted nose x,y coordinates lie within the boundaries of the detected mask's bounding boxes. Without the ability to know the x,y coordinates of the nose, individuals with masks that do not cover their noses would not be flagged as non-compliant. Individuals in Figure 3 are just some selected examples that illustrates this issue of poorly worn masks, whereby the model correctly flags them as 'unmasked' even though there was a mask detected within the bounding boxes corresponding to each of the individuals. Therefore, by including a pretrained human keypoint detector, the model was able to detect and record individuals who were non-compliant.
Tracking Individuals and their status
If its just still images that were being analyzed, what was covered so far in this article would be sufficient for subsequent analysis. However, since it is envisioned that video footage captured from CCTVs will be utilized, there is an added dimension of time involved. A good tracker would be needed to assign each individual captured in the video a specific ID number and track their movements together with their mask compliance status from frame to frame throughout the entire footage. This will enable the collection of important data such as: the number of individuals that were not masked, the location of the premises that they ventured to; the time of the day and the number of other individuals in close proximity. All these data generated by the model would be useful for any subsequent statistical analysis. By utilizing a tracker, info about each individual's 'mask' status from each captured frame would be able to be recorded and subsequently analyzed.
ByteTrack utilized in this project is a 2021 state-of-the-art (SOTA) multi-object tracker that has shown relatively good performance in minimizing the number of ID switching especially during occlusion. ID switching is defined as the same individual being assigned multiple IDs at different points in time of the video. This normally happens during occlusion whereby the tracked individual disappears from view because he/she was occluded by another object or person. The tracker would lose track of the individual for a number of frames and will then assign the same individual a new ID the moment he/she is again visible. With this capability, the model will be able to keep track of the 'mask' status of multiple individuals from frame to frame.
Limitations
Mask detection accuracy
The current iteration of the custom trained mask detector still has difficulty detecting masks in a number of instances, especially when individuals were further away from the camera. This could be due to the inherent difficulty of detecting smaller objects and the fact that most of the training data obtained were of individuals in close proximity to the camera rather than crowds of people wearing masks. Further training using a more diverse dataset having many instances of multiple individuals wearing masks from a distance, would be instrumental in improving the performance of the detector.
Lack of context
For some businesses such as restaurants, there are situations where it is perfectly fine for some individuals to be unmasked while others have to remain so. With the current implementation, the system would still flag unmasked diners having a meal as non-compliant, even though its would appear to any human that its completely reasonable to eat without a mask on. This would result in a great deal of false positives in the data, due to the fact that all the model does is to detect the locations of people and masks, but not classifying the activity of each individual, therefore it cannot differentiate in what context or activity it is definitely acceptable to not have a mask on.
Future Work
Utilizing Synthetic Data
Obtaining training data that was legally free to use and then labeling them manually was the most difficult and tedious part of the project. The lack of quantity and a diverse training dataset hampered the performance of the custom mask detector. One possible method of increasing the number of training instances relatively easily and quickly would be to generate synthetic data. This can be done by utilizing computer graphics software and/or game engines such as Blender and Unreal Engine respectively to generate photorealistic images for training. This approach can theoretically generate an infinite number of training instances quickly without the need for manual labeling. Furthermore, in order to help the model generalize well, features such as lighting effects, camera settings, colors, geometry, etc. can be easily modified in an automated manner using a scripting language such as python.
Human Action Recognition
By incorporating human action recognition in addition to object classification and localization, the model would be able to assign to each detected individual a predicted action/activity for that particular period of frames in time. However, this would definitely increase the required inference time as an additional dimension of time is taken into account when multiple frames are needed to classify an action. Nevertheless, this approach would definitely be useful for analyzing CCTV footage from F&B premises.
Investor Relations | Mentor | Deep Tech Venture Building | Alumni Relations | Corporate Development | Ecosystem Builder. Opinions and Views Expressed are My Own
3 年Sean G.D. Tan awesome