How to Lie About Model Accuracy

I'm still looking at the Larch Casebearer Data. At this point I've produced four models that are at least close to best for their model types; one each for YOLOv10 nano, YOLOv10 balance, YOLO11 nano, and YOLO11 large. Despite the size differences, 10b and 11l have about 10 times the weights of their respective nano model, their performance is pretty consistent. COCO MAP is slightly below 50 for the nano models and a bit above for the larger ones. I think, given the homogeneity of the data and the similarity of the detection classes, that the small models can capture most of the information.

But that's not what I want to talk about.

When I switched from evaluating the validation set to evaluating the test set, I had a rude shock. I focus on the YOLO11n results, but the numbers were comparable across the board.


YOLO11n Val Accuracy


YOLO11n Test Accuracy

(Excuse the selectors in the Recall column. I can only be bothered to do this screenshot-and-paste stuff so many times.)

MAP for Healthy trees dropped about 10 points from the validation to the test set. It is the smallest class, but for exactly that reason a major drop was concerning. The global MAP is an unweighted average so that took a slight tumble as well.

What could have happened? I had stratified the training, validation, and test data by location but otherwise the split was random. The ratio of instances for each class was reasonably consistent, so that wasn't likely to be a problem. The validation set is not directly used in training, but validation accuracy is used to pick the best model. Could that cause some sort of overfitting? At this point I had a brainwave. The validation and test sets were the same size. Why don't I switch them?

The results were something of a relief. The model hadn't overfit on the validation data; by dumb luck detection on the validation data was simply easier than it was on the test data. Accuracy on both the validation and test dataset were about the same as they had been before the switch


YOLOn on the old Test/new Val set


YOLO11n on the old Val/new Test set

This got me thinking about the way I had split the data. Should I have paid more attention to how the data had been partitioned? It's hard to keep classes equally distributed in an object detection problem and I'm not sure I could have done better intentionally.

Then I had another thought. Suppose I were some unscrupulous model developer. If I wanted to make my model accuracy look as good as possible I could easily switch train/val/test partitions until I had the best looking test accuracy. Or I could even just shuffle val and test until I got answers I liked. In this case to the tune of +2.5 mAP, which is often the improvement reported for a new model framework.

It's a good thing we don't have any unscrupulous model developers? Isn't it?

要查看或添加评论,请登录

Daniel Morton PhD的更多文章

  • Where Does Logistic Regression Come From?

    Where Does Logistic Regression Come From?

    The question is really: why does logistic regression take the form that it does? Why is the link function, for that is…

  • The Derivative of sin(x)

    The Derivative of sin(x)

    How do you derive the derivative of sin(x). Most of the answers you're likely to come up with (i.

  • Mislabeled Data - Not as Bad as You'd Think

    Mislabeled Data - Not as Bad as You'd Think

    Suppose I gave you a nice set of training data. Twenty features.

  • Claude and the TAs Nightmare

    Claude and the TAs Nightmare

    Back in my teaching assistant days there were two types of homework I liked to grade. There was the rare student who…

  • Claude fails Sideways Arithmetic

    Claude fails Sideways Arithmetic

    Sideways Arithmetic from Wayside School. I worked through that book when I was ten.

  • Two Issues About Object Detection Accuracy

    Two Issues About Object Detection Accuracy

    Object detection answers two questions simultaneously. What is it? And where is it? For the most part, computer vision…

  • ClaudeAI and ChatGPT try a Brain Tickler

    ClaudeAI and ChatGPT try a Brain Tickler

    Today's NYT Brain Ticker. Add two W's to each word and anagram the result to get a new word: 1.

  • Another One Bytes the Dust

    Another One Bytes the Dust

    At this point there's not even much to say. I think we all know that Claude, and to a lesser extent ChatGPT, can do…

  • Into the Forest I Go (Again) - Part 1

    Into the Forest I Go (Again) - Part 1

    This is an update of something I worked on a few years back. At the time Colab's pricing was still reasonable and…

  • Claude Goes to Stanford

    Claude Goes to Stanford

    I assigned ClaudeAI to take Stanford CS221 - Artificial Intelligence: Principles and Techniques. It passed with…

社区洞察

其他会员也浏览了