What is the accuracy and what does it mean to the business?
Daniel Falk
CTO/Founder ?? Custom EDGE apps for network cameras ?? AI/ML/Computer Vision ?? MLOps ?? Edge Analytics ?? Entrepreneur ?? Writing code, motivating developers and uncovering valuable insights
While in New York, I was strolling around in an automated no-checkout Amazon Go store surrounded by cameras, sensors and AI. Since being allergic to some food, I lifted an energy bar, read the ingredients declaration and put it back. After doing this repeatedly, I thought about what accuracy is needed for a system like this? What measurements do you even use to define it?
I will focus on the system and the customer interaction of an automatic no checkout store, but the important thing is to understand that a complex system consisting of multiple stages has no simple definition of accuracy or performance measurement. What metrics are important depends on how the system will be used and what you are trying to understand or improve.
As an AI engineer, you might want to have detailed metrics of each specific step in the process to know what to improve. As a product owner, you might need a single metric describing the whole system. As a UX engineer, you should be interested in the distribution and drivers of the numbers given the user or situation to understand the impact on the users and avoid problematic bias or discrimination.
Many of the major players in the tech industry, such as Google, Microsoft and LinkedIn, have had embarrassing experiences with unfair or discriminating algorithms or products? ?. Facial recognition used by the FBI for public safety has had tests indicating that it is ten times more likely to falsely match two black women than two white women?. Identifying risks like this and the user’s experience often requires the UX engineer’s thorough knowledge about the market and the customers, something that today is often overseen.
The system of a no checkout store is complex. Each time I lift a piece of merchandise, there is an action classifier trying to decide whether I picked it up or not, whether it is a single unit or more, whether I put it back on the shelf or if I put something else back on the wrong shelf. Furthermore, it has to decide who did the action, is there someone next to me reaching in front of me to take the candy bar?
Let’s start with a deep dive into arguably the simplest part of the system to evaluate, deciding whether an item was picked up or not. For classification, we normally group each prediction into 4 categories: true positives (product was picked up and classified as picked up), true negatives (the product was not picked up and was not classified as picked up), false positives (the product was not picked up but was classified as picked up) and false negatives (the product was picked up but was not classified as picked up). This is the case of a binary classifier and the accuracy is defined as the number of correct predictions over all predictions:
accuracy = (TP + TN) / (TP + TN + FP + FN)
From this, it can be seen that specifying only a lower limit of the accuracy leaves information out, the tendency to miss that an item was picked up versus the tendency to falsely predict that an item was picked up when it was not. This balance is most often easy to configure or change during the training of a classifier and the result of changing this parameter can be visualized using a ROC curve. This balance might be a business critical decision that can have a high impact on customers but is too often left for an AI engineer or programmer to decide ad hoc. The question is, is it better that some customers are not charged for a product than that customers gets falsely charged for products they didn't take?
This balance might be a business critical decision that can have a high impact on customers but is too often left for an AI engineer or programmer to decide ad hoc.
In the case of an Amazon Go-like store where it is important to get the customers comfortable with being observed by hundreds of cameras when they are doing their everyday shopping, the effect of a high false-positive ratio (over charging) could probably be devastating. Eliminating the risk of false positives completely is not an option. This is a problem Amazon Go has reduced by making refunds extremely easy. By only swiping left on the item in your receipt, you can get refunded with no considerable effort1. This can also be used as an input to pinpoint situations where the algorithms have failed, resulting in more valuable training data. It must however be pointed out that these indicators most probably are clearly biased towards reporting false positives over false negatives since customers would benefit from not reporting a product missing from their receipt. This also makes it a possibly dangerous, although valuable metric to track.
With the action classification, we have discussed so far, it remains one major issue, what should be classified as a true negative? The classification is done in continuous time. A product can be picked up at any point in time, so how do we count the number of instances the product was not picked up and the detector didn't trigger? One tempting way to solve this when using video analytics could be to count the number of video frames which fulfills the requirement. This would, however, create a biased and unfair evaluation since there could be more than one product picked up at each video frame, and certainly more than one false positive from the detector in each video frame. A possibility would be to manually split the videos into segments and group them based on the true number of picked up products in the sequence and then calculate an accuracy measure for each group, demanding that the number of predicted detections should be correct. Although metrics like this could be very valuable for finding the key drivers of accuracy, there are many simpler metrics. A common solution to this is to use sensitivity (aka recall) and precision as measures instead of accuracy.
sensitivity = TP / (TP + FN)
precision = TP / (TP + FP)
The sensitivity is thus an estimate of the probability that a picked-up product does trigger a detection. In contrast, precision estimates the probability that detection was actually triggered by a real product being picked up. Having two numbers describing these probabilities makes it easier to prioritize them against each other depending on the detector’s use case. In some cases, there is a need for a single metric describing how good the predictions are, in that case, we normally use the f1-score, which is a combination (more specifically the harmonic mean) of the sensitivity and the precision.
f1-score = 2 * (sensitivity * precision) / (sensitivity + precision)
By reducing from two dimensions (sensitivity and precision) to one dimension (f1-score) we do lose valuable information but instead, gain an easier to understand and more straightforward to compare metric. In order to have the cake and eat it too, we should strive for a simple and graspable visualization where it is easy to drill down to details. We could default to the f1-score and allow everyone to easily drill down to show sensitivity vs. precision or f1-score aggregated by the number of products picked up in the same video segment. Deciding what drill-down features to use again ties back to the use-case of the product and the question in what situations it is acceptable to have worse performance if the overall performance is compensated for by the easier situations.
Stepping back, it is not enough to classify what a person picked up. We also need to know who the person is to know what account to charge. When entering the store, the customer will scan their phone to connect their physical person with their payment account. Each time the person picks up a product, we need to match the person to the person scanning their phone. This can be done either by reidentification (re-id) between two or more points in time or it can be done by tracking the person from the entrance all the way to the exit. To gain the best performance, a combination of the two is probably preferred, given that the computation cost is acceptable.
In the reidentification step, we create a gallery of all the people walking into the store, as soon as they pick up a product, we match the image of them (called the probe) to the correct image in the gallery. When they exit the store, we can remove them from the gallery. Reidentification is normally assessed using the cumulative match characteristics curve, which describes the ratio of matches where the correct person is found in a set of k guessed persons. The curve is created by plotting the hit ratio against the variable k. In biometric systems for security surveillance, etc., it is often enough to narrow down the selection of persons to a few and let an operator select the correct match. In the case of a cashier-less store, it is, however, a must to find the one and only person to place the cost in the right account. This is called the rank one performance.
In the case of a cashier-less store, it is, however, a must to find the one and only person to place the cost in the right account. This is called the rank one performance.
The problem is made easier if we have a closed set identification problem2, in this case meaning that we know that a person seen in the store also has walked through the entrance. This assumption can be questioned due to the case of tail gating or occlusion. When the gallery is small, there are not as many possible matches and the rank one performance tends to be higher2. When more people enter the store, the problem becomes harder and the accuracy tends to decline. This also means that when sampling images for your validation set, you can significantly affect the performance metric of the re-id by the selection you do. This might be consciously or accidentally, e.g., by sampling at a time with low footfall.
Another factor that can greatly affect the performance is the diversity of the views, is the view of the person in the probe, similar to the view of the same person in the gallery? Having different camera models mounted in different view angles and lighting conditions will make the problem harder. Evaluation of the system can also be affected by the selection of probes and galleries in the validation set. Should we only use probes taken from another camera than the gallery or should we select randomly and allow the same view (but at different times) for both the probe and gallery images? The most important is to be systematic and remember this choice when comparing different algorithms.
Evaluation of the system can also be affected by the selection of probes and galleries in the validation set.
Object tracking can also be used to know the identity of a person walking through the store. Relying only on tracking builds on the assumption that we can watch and track the person at all times from entering the store to exiting the store, which arguably is false. There exist multiple types of measures for tracking accuracy, such as true positives (a correctly detected track), false positives (a “ghost track”), false negatives (a missed track), merged tracks and split tracks3. Some of the measures might be biased, e.g., towards rating tracking of large objects with higher accuracy than small objects3. Many measures also rely on arbitrary thresholds for, e.g., ratio of overlap for bounding boxes or maximum distance between centroids. These thresholds might be hard to relate to the use case and might make it hard to compare different published algorithms.
One very common metric that reduces false negatives, false positives and identity switches to a single number is the MOTA metric?. Inspecting the different metrics, we can see that if billing customers in the store using a combination of tracking and reidentification an identity swap would be very bad since we could charge the wrong person for the product. A split of the track or a false negative would probably be something we could deal with using the information from the reidentification. In security surveillance, a missed track is probably much worse than a false positive, which could be filtered by a human operator.
It has been shown that human observers give more importance to whether a person is detected than the absence of identity swaps...
It has been shown that human observers give more importance to whether a person is detected than the absence of identity swaps?, indicating it would be little help for the evaluation of a no-checkout store system. Since tracking is an incremental process that builds on top of the previous step, it becomes harder and harder the longer the track is. Since the normally used metrics for multi-object tracking does not consider the length of the track it becomes obvious that the selection of the validation set can affect the accuracy metric, e.g., different time of the day may have different characteristic about the time a person spends in the store. It might also be that we want to value different groups of customers differently. If someone enters the store, looks around and exits after a short while, buying nothing, it is not important if we tracked them correctly or not. The persons who spend a lot of time in the store buying lots of products might be more important. Here we don't want the short tracks to lift the average accuracy so it looks better than what it actually is in the truly important situations.
Once exiting the store, there is an approximately 10-minute delay before the receipt shows in the app. Although the analytics are very heavy and might struggle to run in real time, I would believe that the large delay is to allow for a human operator to occasionally verify or correct the analytics result. Using a method known as human-in-the-loop, the human operator gets fed a few of the cases, which can be selected, e.g., based on the confidence of the analytics. Some situations are easy for the detector and classifier and these might be sent directly to the customer’s app while others are more complicated and corrected by the operator. The operator serves not only to correct the faulty classifications but also to generate high value annotated data from the specific situations where the classifier struggles.
Probabilistic classifiers yield confidence together with the classification. Some algorithms, such as neural networks trained with a well-selected loss function, outputs an estimate of the class distribution, but it might be accurate only for the data it has seen during training. These classifiers can be proficient in predicting confidence in common situations but might show very high confidence when classifying a totally unknown situation. In these cases, it is important to measure the accuracy of the confidence estimation to see if the algorithms "knows when they do not know". By grouping the validation data by the predicted confidence and plotting the accuracy for each group we get a reliability diagram?. Different reliability diagrams could be created for different situations to gain insight into when the predicted confidence is accurate and not. Confidence histograms can analyze the data regarding the prediction confidence.
An example of a reliability diagram and a confidence histogram visualizing the algorithms ability to "know when it does not know". Image credits to hollance/reliability-diagrams at github. Licence: MIT
If lack of training data in specific situations has led to over fitting and too high confidence in the classifications, sampling new training data based on the confidence will increase data in other than the affected situations, potentially making the problem even more severe. Data collection should be done using different strategies and the importance of visualization of the drivers for different metrics should not be underestimated.
In conclusion, I argue that one can not measure the accuracy of an algorithm or technology. The only way is to measure the accuracy in a known use case. You can always find a metric defined to achieve the "accuracy" you are looking for, but this metric means nothing unless it qualifies the business value the algorithm brings you. The metrics you set and the drivers you understand will define the whole product and thus, it is very important that stakeholders at all levels from developer to end customer are involved in the metrics specification process.
The metrics you set and the drivers you understand will define the whole product
1What it's like inside Amazon's futuristic, automated store
2A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets
3Performance Evaluation of Object Tracking Algorithms
?Tracking the Trackers: An Analysis of the State of the Art in Multiple Object Tracking
?Inside Amazon’s surveillance-powered, no-checkout convenience store
?On Calibration of Modern Neural Networks
?Discriminating algorithms: 5 times AI showed prejudice
?The Best Algorithms Struggle to Recognize Black Faces Equally