ML: Examining the Test Set

ML: Examining the Test Set

I recently saw a post where someone said “Never touch your test set.” The theory was that you (as the algorithm designer) are part of the training algorithm so by looking at your test set rather than final performance, you are contaminating your test set. While that may work academically, it doesn’t work to ship an Machine Learning (ML) customer experience because it doesn’t allow you to do proper failure analysis, ignores the real world test set feedback, and doesn’t allow you to clean your test set.

No alt text provided for this image

Failure Analysis 

When I started on Face ID, the policy was that only QA could look at the test set. They were following this idea of separation of training and testing, but this idea started to break down as we algorithm engineers needed to look at the data ourselves. 

First, it broke down when QA explained where the algorithm had failed on the test set. Then someone had to fix the issue which generally required looking at the test set or collecting more data. Data collection can be a laborious process at times, and tons of data was being collected at the time to cover all the bases. So it was easier and faster to examine the failure cases on hand. 

No alt text provided for this image

Soon enough, the policy changed because it didn’t make sense anymore. We were out of the academic world and into the real world of trying to ship products. This sped up the development cycle and allowed us to ship the biggest ML feature at the time to a hand held device. 

Real World Feedback

If you don’t want to touch your test data, then don’t collect any customer feedback or logs. Don’t read any reviews of your feature. That could potentially impact how you train your algorithm.

The reality is that all ML features have training, validation, and testing data sets that are collected in more controlled environments, but then you have actual data from the field. One can have a few levels of this information including all the images, meta data only, statistical information over large customer bases, and/or internal/external surveys. 

No alt text provided for this image

Each piece of this feedback informs how to better train the algorithm through policy or more likely, data collection and retraining. If there is a survey issue, you can reproduce it. If there is meta data indicating an issue, you can dig back into older user studies or commission new ones. Whatever the case, all this data is part of the testing of the algorithm generalizing to the whole world. 

Cleaning Data

If you don’t look at your test set, how do you find out the cases that are mislabeled? I don’t think the intent of the post was to say don’t clean your dataset, but I prefer to make absolutely certain that labeling and data cleaning have to occur for all your data. That is how one has confidence in the performance numbers. 

No alt text provided for this image
Don’t train on your test set

I wouldn’t say don’t touch your test data; you should look at your test data unless you want your algorithm to suck in the real world. Just don’t train on it. 

---------------------

If you like, follow me on Twitter and YouTube where I post videos espresso shots on different machines and espresso related stuff. You can also find me on LinkedIn

Further readings of mine:

Data Science: Essentials

Abandon Ship: How a startup went under

Dissertation Regret

Part of the Team

How to Interview a Company

Thoughts on Leaving

A Day in the Life of a Data Scientist

Design of Experiment: Data Collection


Judy Dayhoff

Ph.D., Mathematical Biophysics / Deep Learning, Neural Networks, Artificial Intelligence, Machine Learning, Data Science

5 年

But if you made decisions about your model based on your first test set's performance, then you would, ideally, need a new test set for the next changes in the model.

Blaine Bateman, EAF

Chief Data Scientist at EAF LLC

5 年

The admonition is regarding using test set performance as a measure of performance on unseen data. Otherwise there’s no reason not to consider it. But keep in mind that you will need a NEW test set to evaluate any updated method on unseen data.

回复

要查看或添加评论,请登录

Dr. Robert McKeon Aloe的更多文章

  • Ph.D. Interviews

    Ph.D. Interviews

    I have interviewed mostly Ph.D.

  • How to break into Data Science the easy way

    How to break into Data Science the easy way

    Scratch that; there’s not an easy way. Data science has become a hot topic the past few years along side machine…

    5 条评论
  • Privacy in Machine Learning: PII

    Privacy in Machine Learning: PII

    Privacy is not a value explicitly written into the US Constitution, but the essentials are there. As a democratic…

    1 条评论
  • Mastering LinkedIn

    Mastering LinkedIn

    Account Creation I never had a LinkedIn account until I was searching for a job, and then I only paid attention to it…

    1 条评论
  • Withdrawing a Conference Paper

    Withdrawing a Conference Paper

    In graduate school, I tried all sorts of optimizations aimed at making my face matcher work better and faster. I found…

    1 条评论
  • Thoughts on Leaving

    Thoughts on Leaving

    Relax, I’m not leaving my current job right now. I’ve been writing about many different aspects of my work experience…

  • Crashing the Student Computer Lab

    Crashing the Student Computer Lab

    In my last year of graduate school at Notre Dame, I used over 1,000,000 computer hours or just over 114 years of…

    3 条评论
  • Presentation Essentials

    Presentation Essentials

    I have fallen asleep in my fair share of presentations, and I’ve worked hard at making sure my presentations are not…

  • Design of Experiment: Data Collection

    Design of Experiment: Data Collection

    Anyone can collect data; some people can collect good data. The key theme to any good data collection is data…

  • Preserving LinkedIn for Professionalism

    Preserving LinkedIn for Professionalism

    I recently saw a discussion on LinkedIn about LinkedIn possibly becoming more like Facebook and how that was…

社区洞察

其他会员也浏览了