100k Ways to Mislead with Data
In his book, Understanding Variation, Dr. Donald Wheeler advises to "trust no-one who cannot or will not provide the context for their figures." On the same page, he quotes Shewhart's Rule Two for the Presentation of Data:
"Whenever an average, range, or histogram is used to summarize data, the summary should not mislead the user into taking any action that the user would not take if the data were presented as a time series."
With this in mind, I present below Figure 1, a histogram, that I spied online last night courtesy of a colleague. The original poster lamented Ontario being dead-last in the analysis, although there is much more to lament with respect to what passes for numeracy these days:
Questions: What utility is this analysis? What does the analyst want us to conclude, and by consequence act upon? What is the fulcrum on which their analysis rests?
Allow me to draw your attention to the choice of the unit of comparison: "Per 100,000 People". All very professional and statistical analysis-sounding, right? This is where a lot of misleading conclusions done in the media and online begin when grappling with disparate figures: By manipulating the denominator, ie. the lower number part of a fraction. It leads us to presume that we now have an equal basis for comparing different populations or numerators. Trouble is, what are the populations we're comparing?
I looked up the 2020 Q4 estimates of each province's population and did some quick math to work out the actual cumulative number of vaccines that have been delivered (take the population, divide by 100k then multiply by the "Vaccines/100k" figure). I then sorted by that result and obtained a very different looking chart:
Ontario is now #2 in rollout, with Nova Scotia now in the "lanterne rouge" position, to borrow a term from the Tour de France. This aligns with what's been reported, and to see Quebec at the top given their situation, is no surprise. PEI, the former leaderboard champion, is now just one position ahead of dead last - an analysis up-ended, it seems, at least as far as the aforementioned lamenter is concerned.
Question: In what ways is this analysis of any greater utility than the first one? What conclusions might we infer when the data is presented without manipulating the denominator? What other questions might we ask?
A second example for your consideration is presented in Figure 2, below, drawn from Toronto Public Health's Status of COVID19 Cases site for a mid-town neighbourhood:
This seems to be an alarming figure!
Question: Given our inquiry above, what do we need to know to give this analysis more context?
Of course, we need to know the population, which turns out to be around 21,000. You might already see the problem, given the population is smaller than the denominator used as a basis for comparison. Doing some quick math, we learn that the actual number of cases is probably around 60, which are spread out over almost the entire month of December. Quite a departure from 283, which might cause us to draw false conclusions about the reality.
Thankfully, we have an option to strip out the 100k denominator and see the actual number of cases Toronto Public Health has tracked over the course of three weeks in December:
From this, we can now infer the population total that was used as the basis for the "per 100k" comparison: 30k. Somewhere in-between is the actual population, but then, as Deming observed, there's no such thing as a fact based on empirical observations: We'll get as many answers as inquisitors.
Summary: Present Data Without Fancy Distortions
Averages, percentages, denominators - they all provide a convenient means of reducing large or complex numbers down into more manageable chunks that we believe can help us "see" into the data. Playing with the denominator in a fraction is a commonly-practiced means of doing this, eg. "per 100k" or "per million" units of comparison. However, this comes at a cost of stripping away context and meaning, and can mislead our audiences into drawing premature or erroneous conclusions.
In the same book I quote above, Dr. Wheeler provides us with better guidance from Shewhart's Rule One for the Presentation of Data:
"Data should always be presented in such a way that preserves the evidence in the data for all the predictions that might be made from these data."
Translation? Always provide links to the source data along with references on how the data was collected, by whom, and what they represent, in addition to how they were transformed and by what method.
Hopefully this brief post will forearm you with questions to ask whenever you see analyses that entreat you to compare populations using an arbitrary figure of convenience, like "per 100k". At the very least, you should be able to ask for the original source data so you can drawn your own conclusions instead of relying on those of an armchair analyst with an agenda.
Further Reading:
Wheeler, Donald J.: Understanding Variation - The Key to Managing Chaos, SPC Press, 1993.
MBA, BASc. | CLSSMBB | CCMP | Transformation | Program Mgmt | Strategy Planning & Deployment | Board Member
3 年Christopher R. Chapman I have a hunch this is another aspect of the pandemic response that will benefit from PBC's...
Data, Equity, and Economics in Healthcare | Doctoral Student
3 年Per 1000 (Per K) is my preference, but Per XYZ is a fairly common and industry-accepted calculation for public health and population health management applications. I agree with your point about obscuring the "raw" values (and even the estimates of the raw values that you derived from reverse-engineering the Per 100K calculation). When it comes to "per 1000" or the "raw values," I'll always consider the two in tandem depending on what we're trying to identify and why. If a medical provider wants to create a diabetes clinic, should they target a city (or county, or zip code, or whatever granularity) where there is a higher prevalence of diabetes among the target population (per K) or where there is simply the largest volume of people with diabetes? The answer is always "it depends!" I don't consider the calculation flawed or manipulative because it is effective for what it should be used for. The interpretation and messaging of it is where things get botched. Most people watching (or maybe even creating) this segment aren't going to dig into this kind of detail and will just latch on to the soundbite. Particularly the media.. and people interpreting it through a lens of what they already feel is true..
Another example of monkeying with the denominator from the media: If 100k is too big, just drop it down to 100 so you can make a point about vaccinating 0.14 people. As I ask in my above post: What utility is this analysis?
Accessibility is the triple threat! 1. Reach the 27% currently neglected. 2. Retain and attract a dedicated workforce. 3. Protect from lawsuits and diminished reputation.
3 年My college Statistics Professional told me.."There are liars, there are damn liars and then there are Statisticians." ??
I help SMB leaders build new business performance capabilities through simple golf experiences
3 年One of my favorite misleading examples I’ve ever seen “ABC Company experienced year over year customer growth in a specialized sector of 240%!” Digging into the numbers, the starting number of customers was quite small, less than 50. The headline wasn’t lying; however, you see my point. ??