登录查看更多内容

Design of Experiment: Data Collection

Dr. Robert McKeon Aloe

Senior Machine Learning Engineer at Apple

发布日期: 2019年1月9日

Anyone can collect data; some people can collect good data. The key theme to any good data collection is data compliance. Good compliance leads to good data, and compliance is a theme seen at all levels, so let’s examine them.

Stages of a Data Collection:

Each step requires compliance. For the pre-study tasks, people are checking that the different components of the pipeline are complying to protocol. At Ok2Study, everyone is signing off that what has been done and what people committed to doing complies with what they initially wanted.

Compliance List:

Collection Software Compliance

This is simply making sure the software will not have issues during collection. Seems simple but bugs could cause data loss which is money. Running all study software through QA is key to catching those bugs before data loss.

Infrastructure Compliance

As part of the dry-run, you make sure the data goes all the way through infrastructure so that you know it is saving properly. This also is a good time to remove unforeseen bottlenecks. When collecting GB’s and TB’s of data per day, being able to ingest that data becomes just as important as the collection.

Hardware Compliance

Does the hardware collect what you want? Is it close? What are the caveats? Will there be any heating issues, charging issues, or any issues in different countries with different voltages/phases for their electricity.

Hardware Calibration Compliance

At Notre Dame, I didn't calibrate my setup every time because calibration took a long time. Some data paid the price later especially after someone kicked my setup by accident, and I had to fix everything. Calibration had two steps: taking checkerboard images of all the image planes and insuring the two laser planes were aligned. I always did the first step, but I didn’t do the second step as often, and that was the step that had issue. There was an iterative process I could do with a ball to check how well the two light screens were calibrated to one another rather quickly, but I didn't figure trick out until the end of my collection.

Safety Compliance

If your rig could cause health issues, you have a problem. In grad school, my setup used Class 3R lasers that were spread out into a line using 5mW of power. However, it was probably a lower class of laser because it was spread out using a beam spreader, but I didn’t have it tested. At the time, there wasn’t much concern from the safety office unless it was over 5mW. That changed right after I graduated, but laser safety is a constant issue with prototype hardware because usually the mechanisms to keep the lasers safe are still in development during hardware development.

Your protocol also shouldn’t put anyone’s health at risk. Even if you’re willing to take a chance, there is a large liability if someone gets hurt.

Legal Compliance

Some countries allow data like face images to be collected, but the laws vary. The aim is to not be in any gray zone about data. Face images are considered Personally Identifiable Information (PII), and in the past few years, especially since GDPR, governments have paid particular attention to privacy.

China, for example, doesn’t allow PII data to be exported. In the US, you can generally collect PII data in public, but in Europe, you can not. Unlike the US, in France and Germany, they don’t believe an employee can consent to a user study by their employer that collects PII data because the simple employee/employer relationship is a form of coercion.

Usually, there is some compensation for a user study, but keep in mind, too much compensation could also be seen as financial coercion.

Recruitment Compliance

Recruitment is quite an interesting bit of the process. You can’t discriminate, nor do you want to for a good dataset, but you also have to pay attention to special groups like pregnant women, the elderly, and young children.

On your end though, you should also check that the demographics you are looking for are being collected. If they aren’t, your dataset will be deficient. Don’t assume your algorithm is age or ethnicity or gender invariant; always check.

Dry-Run Compliance

Dry-run is key because it gives a chance not only to check for operator compliance but also to make corrections to the protocol that didn’t quite take into considering the subject or the actual data being collected. This is usually when the protocol is tweeted quite a bit.

In my first month at Apple, I designed my first user study for wrist detection. I felt pressure and signed off on the dry run before I fully looked at all the data. As soon as I did, I realized one of the settings from the app was off, and the app had to go through another revision before the study could continue. I learned not to rush otherwise I sacrifice quality.

Moderator Compliance

The other key is moderator compliance to the protocol because if the operator can’t comply, neither can the subject. I’ve seen this happen both ways. In one study at Apple, I found out that half the data had the wrong label because one of the two moderators swapped the order of collection. No data was lost, but I spent a far bit of time validating data and fixing labels until I had a complete dataset that I could trust.

At Notre Dame, I didn’t think through or pay attention to dry run data for my user studies with people walking through my 3D scanner. During the collection, they start off standing and then would start walking. What I didn’t consider is that people open their mouths slightly when they start to to walk. It is unintentional, but I definitely didn't consider that people might want to breath while walking. People also walked through too quickly. In both cases, I tried to do things like taking multiple scans or giving people directions, but it was not as effective as I wish it would have been.

Participant Compliance

When doing data collection at Digital Signal Corp, we wanted to get different head poses. In dry run, it appeared people were only moving their eyes to the appropriate marker but not their head as this more natural human motion. The fix was to put foot markers and ask people to move their feet to that position. This greatly reduced the errors and made the collection more intuitive for the participants.

At Notre Dame, I had issues with my data because people walked through too quickly. I needed them to go through at less and 1/4th full walking pace (3 mph), but most went through too fast. I didn’t think enough outside of the box to figure out how to get people to go slowly naturally, but a sign or blinking sign may have helped.

Random Compliance Checks

During the lead up to Face ID being launched, my team went out and collected a large set of potential aggressors to see if we were missing anything in our larger data collections, things would be normal to a regular user.

We used a script that cycled through a bunch of settings, and a month into our data exploration, the firmware was updated. We noticed some strange issues, and I’m not sure why we didn’t correlate it to the firmware update, but a few weeks in, I filed a bug because it was clear some of the data was way outside of expectation. It was a condition one would get into only because of cycling through the settings as we were doing.

This bug was caught by a few others about the same time, and it was fixed a week later. For us, it meant the loss of a full week’s worth of data but half of the past month had been salvageable. The bug should have been caught earlier if we had been doing regular compliance checking during our data collections.

Finally, the Data I’ve always wanted!

Usually, you don’t know the exact data you need on the first data collection. Data collection is part of this iterative loop of algorithm requirements, data collection, and failure analysis.

要查看或添加评论，请登录

Dr. Robert McKeon Aloe的更多文章

Ph.D. Interviews

2019年7月30日

Ph.D. Interviews

I have interviewed mostly Ph.D.
How to break into Data Science the easy way

2019年7月16日

How to break into Data Science the easy way

Scratch that; there’s not an easy way. Data science has become a hot topic the past few years along side machine…

5 条评论
ML: Examining the Test Set

2019年5月13日

ML: Examining the Test Set

I recently saw a post where someone said “Never touch your test set.” The theory was that you (as the algorithm…

8 条评论
Privacy in Machine Learning: PII

2019年4月24日

Privacy in Machine Learning: PII

Privacy is not a value explicitly written into the US Constitution, but the essentials are there. As a democratic…

1 条评论
Mastering LinkedIn

2019年3月27日

Mastering LinkedIn

Account Creation I never had a LinkedIn account until I was searching for a job, and then I only paid attention to it…

1 条评论
Withdrawing a Conference Paper

2019年3月14日

Withdrawing a Conference Paper

In graduate school, I tried all sorts of optimizations aimed at making my face matcher work better and faster. I found…

1 条评论
Thoughts on Leaving

2019年2月26日

Thoughts on Leaving

Relax, I’m not leaving my current job right now. I’ve been writing about many different aspects of my work experience…
Crashing the Student Computer Lab

2019年2月6日

Crashing the Student Computer Lab

In my last year of graduate school at Notre Dame, I used over 1,000,000 computer hours or just over 114 years of…

3 条评论
Presentation Essentials

2019年1月23日

Presentation Essentials

I have fallen asleep in my fair share of presentations, and I’ve worked hard at making sure my presentations are not…
Preserving LinkedIn for Professionalism

2019年1月2日

Preserving LinkedIn for Professionalism

I recently saw a discussion on LinkedIn about LinkedIn possibly becoming more like Facebook and how that was…

See all articles

Design of Experiment: Data Collection

Dr. Robert McKeon Aloe

Senior Machine Learning Engineer at Apple

Stages of a Data Collection:

Compliance List:

Collection Software Compliance

Infrastructure Compliance

Hardware Compliance

Hardware Calibration Compliance

Safety Compliance

Legal Compliance

Recruitment Compliance

Dry-Run Compliance

Moderator Compliance

Participant Compliance

Random Compliance Checks

Finally, the Data I’ve always wanted!

Dr. Robert McKeon Aloe的更多文章

社区洞察

其他会员也浏览了

From Data Overload to Clarity

Prescriptive Analytics: A Deeper Dive

Is data collected from multiple sources making the reconciliation process difficult?

Data Quality and Data Cleaning Techniques

DIGITIZATION and DIGITALIZATION

Three-step remedy to sub-optimal design

Simulation Software Market: Research Strategies with Share Analysis, Top Key Players with Opportunities Forecast to 2027

Vulnerability Scan Service Market 2023 Business Statistics and Research Methodology

Polysingularity in Information Technology: A Comprehensive Exploration

Ups and downs of Data Automation

Stages of a Data Collection:

Compliance List:

Collection Software Compliance

Infrastructure Compliance

Hardware Compliance

Hardware Calibration Compliance

Safety Compliance

Legal Compliance

Recruitment Compliance

Dry-Run Compliance

Moderator Compliance

Participant Compliance

Random Compliance Checks

Finally, the Data I’ve always wanted!

Dr. Robert McKeon Aloe的更多文章

Ph.D. Interviews

How to break into Data Science the easy way

ML: Examining the Test Set

Privacy in Machine Learning: PII

Mastering LinkedIn

Withdrawing a Conference Paper

Thoughts on Leaving

Crashing the Student Computer Lab

Presentation Essentials

Preserving LinkedIn for Professionalism

社区洞察

其他会员也浏览了

From Data Overload to Clarity

Prescriptive Analytics: A Deeper Dive

Is data collected from multiple sources making the reconciliation process difficult?

Data Quality and Data Cleaning Techniques

DIGITIZATION and DIGITALIZATION

Three-step remedy to sub-optimal design

Simulation Software Market: Research Strategies with Share Analysis, Top Key Players with Opportunities Forecast to 2027

Vulnerability Scan Service Market 2023 Business Statistics and Research Methodology

Polysingularity in Information Technology: A Comprehensive Exploration

Ups and downs of Data Automation