Data projects: Are we doing science or engineering?
What do we mean by the ‘science’ part of data science?
Science is about generating and testing explanations.
The Safe & Sound Data Showdown is a challenge in which hundreds of data scientists are currently generating and testing explanations about what puts people at risk on mine sites.
Participants have access to lots of data from different parts of the business — there is data on past incidents from the incident management system, data about rosters and tenure from the HR system, and data about production volumes from the operations system.
What causes safety incidents?
A savvy data scientist will think about the different real world situations that generate these data points, and conjecture why they are related.
For example, incident data is fairly straightforward — an incident record represents one of the bad things we are interested in explaining (and preventing!). What causes them?
Production volumes might tell us about how busy the operation is. We might wonder if busier periods lead to more incidents.
HR data tells us how long people are at work. It’s reasonable to think that workers might be more tired, and consequently less attentive at the end of a shift. Are they more likely to be injured at the end of long shifts?
One participant hypothesized that workers who had been at the company longer might experience fewer incidents because of their greater experience.
He compared the data from the HR system on length of employment to the occurrences of incidents.
Bingo!
The data show that those who had been around longer were involved in fewer incidents.
So this proves that experienced workers are less likely to be injured, and the company could consider longer training periods, preferentially hiring experienced workers, and so on, right??… Right?
领英推荐
Not so fast.
In a great example of scientific rigor, the data scientist considered some alternative explanations for this correlation. What other reasons might experienced workers be involved in fewer incidents?
Well, it could be that experienced workers get promoted into supervisory roles in which they spend less time on the tools — they would have less exposure to risky situations.
Now we can go back to the data to see if we can control for workers who have changed roles.
Domain Expertise vs Data Expertise
This process and the alternate explanations it generated require two very different kinds of expertise:
This makes challenges like the Safe & Sound Data Showdown far more like science than those in which participants compete to build the best machine learning models given a prior explanation.
Science vs Engineering
In “Looks Like Grain” data scientists competed to build the best computer vision models for identifying and classifying defects in grain.
The objective wasn’t to generate new knowledge. We already know that defects in grain mean lower quality grain, and these defects are visible in photographs. Rather, the objective was to use machine learning to automate the process of rating grain quality based on models which identify and classify any visible defects.
Projects like “Looks like Grain” are engineering. They start with an accepted explanation and seek to build the best tools to turn that knowledge into automated systems that eliminate dangerous or repetitive manual work.
Projects like the Safe & Sound Data Showdown, on the other hand, are data science. They create and test new explanations. They generate new knowledge. The outcomes are less certain at the outset, and they require a combination of domain expertise and data skills.
I can’t wait to see all of the interesting explanations that the Humyn.ai community come up with during the Safe & Sound Data Showdown.
Perhaps we’ll even generate some new knowledge that helps us keep people safe at work!