How to Lie About Model Accuracy
Daniel Morton PhD
Senior Data Scientist | Builder | MS Analytics | PhD Mathematics | Machine Learning | Data Science | Deep Learning | Ad Tech
I'm still looking at the Larch Casebearer Data. At this point I've produced four models that are at least close to best for their model types; one each for YOLOv10 nano, YOLOv10 balance, YOLO11 nano, and YOLO11 large. Despite the size differences, 10b and 11l have about 10 times the weights of their respective nano model, their performance is pretty consistent. COCO MAP is slightly below 50 for the nano models and a bit above for the larger ones. I think, given the homogeneity of the data and the similarity of the detection classes, that the small models can capture most of the information.
But that's not what I want to talk about.
When I switched from evaluating the validation set to evaluating the test set, I had a rude shock. I focus on the YOLO11n results, but the numbers were comparable across the board.
(Excuse the selectors in the Recall column. I can only be bothered to do this screenshot-and-paste stuff so many times.)
MAP for Healthy trees dropped about 10 points from the validation to the test set. It is the smallest class, but for exactly that reason a major drop was concerning. The global MAP is an unweighted average so that took a slight tumble as well.
What could have happened? I had stratified the training, validation, and test data by location but otherwise the split was random. The ratio of instances for each class was reasonably consistent, so that wasn't likely to be a problem. The validation set is not directly used in training, but validation accuracy is used to pick the best model. Could that cause some sort of overfitting? At this point I had a brainwave. The validation and test sets were the same size. Why don't I switch them?
The results were something of a relief. The model hadn't overfit on the validation data; by dumb luck detection on the validation data was simply easier than it was on the test data. Accuracy on both the validation and test dataset were about the same as they had been before the switch
This got me thinking about the way I had split the data. Should I have paid more attention to how the data had been partitioned? It's hard to keep classes equally distributed in an object detection problem and I'm not sure I could have done better intentionally.
Then I had another thought. Suppose I were some unscrupulous model developer. If I wanted to make my model accuracy look as good as possible I could easily switch train/val/test partitions until I had the best looking test accuracy. Or I could even just shuffle val and test until I got answers I liked. In this case to the tune of +2.5 mAP, which is often the improvement reported for a new model framework.
It's a good thing we don't have any unscrupulous model developers? Isn't it?