ML: Examining the Test Set
I recently saw a post where someone said “Never touch your test set.” The theory was that you (as the algorithm designer) are part of the training algorithm so by looking at your test set rather than final performance, you are contaminating your test set. While that may work academically, it doesn’t work to ship an Machine Learning (ML) customer experience because it doesn’t allow you to do proper failure analysis, ignores the real world test set feedback, and doesn’t allow you to clean your test set.
Failure Analysis
When I started on Face ID, the policy was that only QA could look at the test set. They were following this idea of separation of training and testing, but this idea started to break down as we algorithm engineers needed to look at the data ourselves.
First, it broke down when QA explained where the algorithm had failed on the test set. Then someone had to fix the issue which generally required looking at the test set or collecting more data. Data collection can be a laborious process at times, and tons of data was being collected at the time to cover all the bases. So it was easier and faster to examine the failure cases on hand.
Soon enough, the policy changed because it didn’t make sense anymore. We were out of the academic world and into the real world of trying to ship products. This sped up the development cycle and allowed us to ship the biggest ML feature at the time to a hand held device.
Real World Feedback
If you don’t want to touch your test data, then don’t collect any customer feedback or logs. Don’t read any reviews of your feature. That could potentially impact how you train your algorithm.
The reality is that all ML features have training, validation, and testing data sets that are collected in more controlled environments, but then you have actual data from the field. One can have a few levels of this information including all the images, meta data only, statistical information over large customer bases, and/or internal/external surveys.
Each piece of this feedback informs how to better train the algorithm through policy or more likely, data collection and retraining. If there is a survey issue, you can reproduce it. If there is meta data indicating an issue, you can dig back into older user studies or commission new ones. Whatever the case, all this data is part of the testing of the algorithm generalizing to the whole world.
Cleaning Data
If you don’t look at your test set, how do you find out the cases that are mislabeled? I don’t think the intent of the post was to say don’t clean your dataset, but I prefer to make absolutely certain that labeling and data cleaning have to occur for all your data. That is how one has confidence in the performance numbers.
Don’t train on your test set
I wouldn’t say don’t touch your test data; you should look at your test data unless you want your algorithm to suck in the real world. Just don’t train on it.
---------------------
If you like, follow me on Twitter and YouTube where I post videos espresso shots on different machines and espresso related stuff. You can also find me on LinkedIn.
Abandon Ship: How a startup went under
A Day in the Life of a Data Scientist
Design of Experiment: Data Collection
Ph.D., Mathematical Biophysics / Deep Learning, Neural Networks, Artificial Intelligence, Machine Learning, Data Science
5 年But if you made decisions about your model based on your first test set's performance, then you would, ideally, need a new test set for the next changes in the model.
Chief Data Scientist at EAF LLC
5 年The admonition is regarding using test set performance as a measure of performance on unseen data. Otherwise there’s no reason not to consider it. But keep in mind that you will need a NEW test set to evaluate any updated method on unseen data.