登录查看更多内容

How to Lie About Model Accuracy

Daniel Morton PhD

Senior Data Scientist | Builder | MS Analytics | PhD Mathematics | Machine Learning | Data Science | Deep Learning | Ad Tech

发布日期: 2024年10月24日

I'm still looking at the Larch Casebearer Data. At this point I've produced four models that are at least close to best for their model types; one each for YOLOv10 nano, YOLOv10 balance, YOLO11 nano, and YOLO11 large. Despite the size differences, 10b and 11l have about 10 times the weights of their respective nano model, their performance is pretty consistent. COCO MAP is slightly below 50 for the nano models and a bit above for the larger ones. I think, given the homogeneity of the data and the similarity of the detection classes, that the small models can capture most of the information.

But that's not what I want to talk about.

When I switched from evaluating the validation set to evaluating the test set, I had a rude shock. I focus on the YOLO11n results, but the numbers were comparable across the board.

(Excuse the selectors in the Recall column. I can only be bothered to do this screenshot-and-paste stuff so many times.)

MAP for Healthy trees dropped about 10 points from the validation to the test set. It is the smallest class, but for exactly that reason a major drop was concerning. The global MAP is an unweighted average so that took a slight tumble as well.

领英推荐

Can Likert Scale Data ever be Continuous?

Karen Grace-Martin 3 个月前

The measure of Central Tendency

Emad Yowakim 2 年前

Version Vectors(I)

Pratik Pandey 2 年前

What could have happened? I had stratified the training, validation, and test data by location but otherwise the split was random. The ratio of instances for each class was reasonably consistent, so that wasn't likely to be a problem. The validation set is not directly used in training, but validation accuracy is used to pick the best model. Could that cause some sort of overfitting? At this point I had a brainwave. The validation and test sets were the same size. Why don't I switch them?

The results were something of a relief. The model hadn't overfit on the validation data; by dumb luck detection on the validation data was simply easier than it was on the test data. Accuracy on both the validation and test dataset were about the same as they had been before the switch

This got me thinking about the way I had split the data. Should I have paid more attention to how the data had been partitioned? It's hard to keep classes equally distributed in an object detection problem and I'm not sure I could have done better intentionally.

Then I had another thought. Suppose I were some unscrupulous model developer. If I wanted to make my model accuracy look as good as possible I could easily switch train/val/test partitions until I had the best looking test accuracy. Or I could even just shuffle val and test until I got answers I liked. In this case to the tune of +2.5 mAP, which is often the improvement reported for a new model framework.

It's a good thing we don't have any unscrupulous model developers? Isn't it?

要查看或添加评论，请登录

Daniel Morton PhD的更多文章

Mislabeled Data - Still Not as Bad as You'd Think

2025年3月26日

Mislabeled Data - Still Not as Bad as You'd Think

This is a followup to a previous article about mislabeled data. https://www.
Where Does Logistic Regression Come From?

2024年11月26日

Where Does Logistic Regression Come From?

The question is really: why does logistic regression take the form that it does? Why is the link function, for that is…
The Derivative of sin(x)

2024年11月26日

The Derivative of sin(x)

How do you derive the derivative of sin(x). Most of the answers you're likely to come up with (i.
Mislabeled Data - Not as Bad as You'd Think

2024年11月18日

Mislabeled Data - Not as Bad as You'd Think

Suppose I gave you a nice set of training data. Twenty features.
Claude and the TAs Nightmare

2024年11月14日

Claude and the TAs Nightmare

Back in my teaching assistant days there were two types of homework I liked to grade. There was the rare student who…
Claude fails Sideways Arithmetic

2024年11月13日

Claude fails Sideways Arithmetic

Sideways Arithmetic from Wayside School. I worked through that book when I was ten.
Two Issues About Object Detection Accuracy

2024年11月12日

Two Issues About Object Detection Accuracy

Object detection answers two questions simultaneously. What is it? And where is it? For the most part, computer vision…
ClaudeAI and ChatGPT try a Brain Tickler

2024年10月25日

ClaudeAI and ChatGPT try a Brain Tickler

Today's NYT Brain Ticker. Add two W's to each word and anagram the result to get a new word: 1.
Another One Bytes the Dust

2024年10月22日

Another One Bytes the Dust

At this point there's not even much to say. I think we all know that Claude, and to a lesser extent ChatGPT, can do…
Into the Forest I Go (Again) - Part 1

2024年10月18日

Into the Forest I Go (Again) - Part 1

This is an update of something I worked on a few years back. At the time Colab's pricing was still reasonable and…

See all articles

How to Lie About Model Accuracy

Daniel Morton PhD

Senior Data Scientist | Builder | MS Analytics | PhD Mathematics | Machine Learning | Data Science | Deep Learning | Ad Tech

领英推荐

Daniel Morton PhD的更多文章

社区洞察

其他会员也浏览了

Going Deeper into Time Series Analysis by Focusing on Frequency

How Big of a Sample Size do you need for Factor Analysis?

CUPED - What you know before the experiment matters as much as what you know after

Always Assume You Are Wrong

Data is not Memory.

Concise Basic Stats Series - Part VIII: Analysis of Variance (ANOVA)

Outliers ( Time Series)

Data vs. Perception

Understanding Dunn's Test: A Guide for Real-World Applications

When Data Gives You the Wrong Answer: The Story of the Bullet Holes and a Crucial Lesson in Data Analysis

领英推荐

Daniel Morton PhD的更多文章

Mislabeled Data - Still Not as Bad as You'd Think

Where Does Logistic Regression Come From?

The Derivative of sin(x)

Mislabeled Data - Not as Bad as You'd Think

Claude and the TAs Nightmare

Claude fails Sideways Arithmetic

Two Issues About Object Detection Accuracy

ClaudeAI and ChatGPT try a Brain Tickler

Another One Bytes the Dust

Into the Forest I Go (Again) - Part 1

社区洞察

其他会员也浏览了

Going Deeper into Time Series Analysis by Focusing on Frequency

How Big of a Sample Size do you need for Factor Analysis?

CUPED - What you know before the experiment matters as much as what you know after

Always Assume You Are Wrong

Data is not Memory.

Concise Basic Stats Series - Part VIII: Analysis of Variance (ANOVA)

Outliers ( Time Series)

Data vs. Perception

Understanding Dunn's Test: A Guide for Real-World Applications

When Data Gives You the Wrong Answer: The Story of the Bullet Holes and a Crucial Lesson in Data Analysis