Data Science, As Practiced
30 years ago, Deming was poking fun at journalists describing the catastrophe of having half the teachers performing below median. Medians, means, and modes are now taught in Middle School, and data science is a hot field. Still, looking at current publications, it is not obvious that data literacy has improved.
I saw this today in a report on manufacturing trends: "Only 9% of CIOs report having an accurate inventory of personal data for employees and customers." Such a statement begs a number of questions:
- How many companies have a CIO?
- Are companies with a CIO better or worse than others in data accuracy?
- Was this calculated on a random sample of CIOs, or on the self-selected sample of those who filled out a questionnaire?
- The expression "CIOs report" suggests checkmarks on a questionnaire. What was the methodology?
In another part of the report: "Between 1998 and 2012, the cost of complying with manufacturing-related rules grew far more rapidly (7.6%) than manufacturing output (0.4%)." This too, begs several questions:
- Are the percentages by year or for the whole period of 15 years? (They are by year).
- What is the source of the data?
- How are the "costs of complying with manufacturing-related rules" estimated?
This sloppiness is not limited to manufacturing. A California real estate analysis website recently offered the following insight: "San Francisco is less family-centric than the surrounding county with 28.87% of the households containing married families with children. The county average for households married with children is 28.87%."
Given that San Francisco is both a city and a county, it is not surprising that the numbers should be the same. What is more surprising is that the same page contains the following table showing different values for city and county:
and a chart of home appreciation with no explanation whatsoever as to how it is calculated:
There are many different ways to do it that yield vastly different results. And, again, the city and county cannot have different curves, given that they have the same perimeter.
The above excerpts are from reports that are all available free of charge on the web, from organizations that sell more in-depth analyses. But, with free samples like these, I am wondering what the incentive is to pay for more.
As both readers and creators of charts, infographics, and analytics in other forms, we need to up our game. They are tools for communication, not decoration, and we must treat them as such. As readers we must demand that they be generated by analysts, not graphic artists, and that they have substance, rigor, and clarity. Everything we see must have a point and and be backed up with documented sources. As authors, we must address our work to readers who care and won't be snowed by flashy graphics.
Thank you for reading. I blog at michelbaudin.com, on ideas from manufacturing operations and on more general topics here on LinkedIn. If you enjoy my posts, please fill out this short form.