Two Issues About Object Detection Accuracy
Daniel Morton PhD
Senior Data Scientist | Builder | MS Analytics | PhD Mathematics | Machine Learning | Data Science | Deep Learning | Ad Tech
Object detection answers two questions simultaneously. What is it? And where is it? For the most part, computer vision models answer these questions independently. A typical CNN detection model will predict the likelihood that there is an object and the dimensions of its bounding box independently of the prediction of what the object is. Mostly independently. Both predictions use the same image and, for the most part, the outputs of the same convolutions. But they make up separate parts of the loss function and neither the bounding box coordinates nor the object classification is an input for predicting the other. The extent that classification and localization are correlated is merely that they are derived from the same input.
In practice this means that most of the difficulty in object detection models is on the classification side. Determining bounding box dimensions is the easy part. To demonstrate we can run the same object detection model twice, once treating all classes separately and once treating all objects as the same class.
I'll use the Larch Case-bearer dataset I've been working with for a while. As a reminder, this is a collection of drone images of Swedish Larch forests, many of which are unwilling hosts to a type of case-bearer moth larva. There are four classes, healthy larch trees, lightly damaged larch trees, heavily damaged larch trees, and some other tree species. Most of the trees are lightly damaged larch trees.
To emphasize how good object detection is when object class is irrelevant I'll use the smallest of the YOLO11 models, YOLO11n. I keep the default image size of 1500x1500 which, even with this small model, requires a batch size of 8. Augmentation consists of horizontal and vertical flips and the few default Albumentations that I can't turn off. (None of which, I think, help accuracy but the model still does well enough.) Mixup, where two images are merged together is set with probability 0.3. Running the relevant notebook in Kaggle took about an hour and a half. Train/Val/Test is 70/15/15 stratified across the different locations in Sweden.
The result on the holdout test set is mAP@50 96.3 and mAP50-95 64.6 which is probably about as close to perfect as I could reasonably expect, especially with a dataset with an average of 68 objects per image.
A typical scene with ground truth boxes is below.
And these are the model predictions.
The detections look very similar. The model output may even be an improvement. The annotator(s) (about whom I know nothing) regularly missed trees on the edge of the image and missed the occasional small tree in the interior of the image. Of course all these trees missed by the annotator and caught by the model count against mAP. A reminder, if you need it, that model accuracy metrics are guidelines, not gospel.
A couple more images illustrate the same point. Note the tree caught by the model on the other side of the road as well as several trees missed by the annotator on the bottom of the scene.
领英推荐
Both of above images have been a mix of healthy and lightly damaged trees. If we include heavily damaged and other species the result is the same. Notice that, once again, the model picks out some trees (not larch, something else) that the annotator missed.
If anything mAP 64.6 is probably an understatement.
Now what happens if we train the same model, YOLO11n, on the same dataset but keep the class labels.
The dominant Low Damage class has mAP numbers that are only slightly lower than the one-class model. Precision drops for the other three classes although mostly remains in respectable territory. The only real weak spot is the Healthy category, many of whom are inaccurately labeled as Low Damage. Since this is, by far, the smallest category this is to be expected.
As with the single class case it may be possible that the metrics aren't telling the whole story. Compare the "ground truth" to the predicted output here. Blue is healthy and cyan is low damage. (Not my choice, YOLO defaults.)
I'm no expert on larch trees or tree diseases but it is obvious that, as the larva numbers increase more and more needles go brown. Some of the trees labeled low damage, especially those at the top of the image look perfectly healthy to me. They look healthy to the model as well. This good be another case of the model improving on the "ground truth" accuracy. Even in Sweden I expect this sort of annotation work is underfunded; the ground truth could be an overworked grad student's best guess. It seems possible that the mAP score undersells the model performance.