US Census “disappears” small communities: A case study in outliers, unbalanced data, and extreme values.
Neil Hamlett, D.Sc., MBA
AI-implementation consultant: Creates value from data through strategic alignment, system-of-systems methodologies to navigate uncertainty.
A recent New York Times article (G. Wezerek and D. Van Riper, February 6, 2020) suggested data de-identification methods by the U.S. Census Bureau might cause some small communities to be omitted from the decennial count. This assertion arises from a misunderstanding. But it nonetheless provides an interesting teaching point for statistical analysis and its limitations.
First, to the point of Wezerek’s and Van Riper’s article. The Census Bureau seeks to make its data more-easily accessible to researchers. Citizens’ privacy concerns make that more and more difficult. The Census Bureau’s solution? Keep two sets of books: One public, and one private.
The “official” set contains all of the information about every village and borough. This is the set on which policy decisions are based. The official set provides the basis for distribution of federal funds, such as school-district funding under the Elementary and Secondary Education Act. But this data is tightly protected. Access is granted on a need-to-know basis. Identity bandits salivate over the idea of getting complete, official Census data.
There is an alternative set for which access is less-stringent. The bureau's Data Ingest and Linkage framework employs Probabilistic Linkage to make it very difficult to associate individual citizens' responses with their personally identifying information (PII). The National Institute of Standards and technologies describes standards and methods that with which federal agencies must comply.
What does this mean for researchers performing studies with Census data? Census' methods provide them with data sets that are approximately statistically equivalent to the actual data. But they differ on the details. Researchers' conclusions may be imprecise to the degrees typical with statistical analysis in general.
What are the sources of these imprecisions here? We briefly consider three. These are outliers, unbalanced data, and extreme values. Census' deidentification process offers a case study to consider some of these affects.
Outliers.
Statistics in general is about examining samples to make generalizations about populations. This very fundamental concept is taught to sixth-grade math students in New York [NY Math Learning Standard 6.SP.A.1]. Methods from classical statistics moreover assume that phenomena can be described by "well-behaved" probability distributions.
The famous "bell curve" — aka, the "Normal" or "Gaussian" distribution — is one of the most-famous. There is even a mathematical principle — the Central Limit Theorem — that suggests that when you mix enough stuff up, it should all look "normal". And, in normal distributions, things should mostly stay pretty close to the average.
In reality, the world does not turn out to be a very "normal" place, in terms of statistical distributions. We get lots of outliers. Wired magazine thought leader Chris Anderson described this phenomenon in his classic article "The Long Tail". The "tail" here is the part of the distribution distant from its peak. Anderson observed that there is a lot going on out there in the tail.
The US retail industry, for example, is dominated by behemoths that play the opposite sides of this. Walmart seems to market to the middle. Its entire supply chain seems designed to maximize the revenue per cubic foot of floor space. It carries the things most people want. Amazon, in contrast, caters to eccentric tastes — the tail.
The chart here depicts data from the Ames, IA housing market, a Kaggle data set beloved of data scientists. We see two outliers, here: Two properties that are particularly far away from the overall trends. Outliers like these can be very hard to explain with statistical analysis alone.
The figure below is from Wezerek and Van Riper. It shows what is happening to census data as a result of de-identification. The statistical "shuffling" tries to maintain consistency in the overall data overall. This is presumably done at some administrative level, such as states and major metropolitan areas. Some smaller jurisdictions can change a lot proportionally, within statistical variation the bureau apparently considers acceptable.
Unbalanced data.
In statistics, classification is about trying to tell whether something fits into one group or another. The picture below contains an abstract illustration from description of a popular open-source tool for statistical calculation. We see three groups here, each distinguished by color. We want to be able to assign any member to its proper color group given just its coordinates.
Assignment here of samples to categories is difficult for two reasons. First, not all of them fall in their corresponding regions in terms of the x- and y-axes. Simon Haykin is one of the leading researchers in a family of methods called Artificial Neural Networks. He describes this as "linear separability". We cannot draw any set of lines that cleanly separate the samples into their classes. Lots of the green samples inevitably fall into the yellow-colored region.
The other problem is that most of the samples above fall into the yellow-colored category. The data are unbalanced. The table here to the right is from a classic statistics textbook [Agresti, 2013]. It reports a study to associate alcohol consumption with birth defects (malformation). We see that these occur in 48 ÷ (48 + 17,066) ≈ 0.28% of the time. A na?ve diagnostic test would simply say that malformations never occur. It would be right 99.72% of the time.
Extreme values.
We observed above that the "real" world is not "normal", at least from the perspective of a bell-curve probability distribution. Technology luminary Mark Anderson observed as much in his "Long-Tail" phenomenon. Some things — like casino games, certain physical phenomena — do behave "nicely". But the randomness and variation in much of our experience cannot be well-explained by the Central Limit Theorem.
Nassim N. Taleb, another prominent thinker explored this in greater detail. His book Black Swan explores the concept in depth. Taleb describes a Black Swan as a special type of outlier, with three key characteristics. First, Black Swans lie outside of the realm of regular expectations. Second, they have extreme impacts. Third, they are only predictable retrospectively.
Taleb goes on to elaborate that socioeconomic phenomena are decidedly non-Gaussian. He argues that they tend to conform to power-law distributions. They long tails. Moreover, market-price movements tend to be log-normal distributions in a popular asset-pricing model.
What does this mean for practical applications of statistical calculation? In the case of the Census Bureau's deidentification methodology, the types of phenomena m Wezerek and Van Riper describe are probably to be affected. Their critique focuses on small communities. These communities are outliers in the big scheme of things.
More-generally, caution should be applied in statistical analysis of socioeconomic phenomena in general. Many now-mainstream statistical approaches relax assumptions about underlying statistical distributions. These models — commonly referred to as "machine learning" —are non-parametric. Black Swans, nonetheless, are going to be problematic for statistical analysis.
What's a Data Scientist to do?
Quants in general owe transparency and openness to users of their work products. Statistical practice is complex. Numerous opportunities exist to get it wrong. In fact, in 2016 the American Statistical Association broadcast a warning about the misuse of a benchmark statistical metric.
Modesty is called for in general. George Box, son-in-law of one of the "founders" of modern statistics, famously said, "All models are wrong. Some models are useful" [Box, 1976]. This was in no way intended to suggest statistics is a punt. Box's oft-misunderstood article describes his father-in-law's tenacious pursuit of the most-complete-possible explanation of uncertainty about the phenomena he studied.
More-recently, Thomas Davenport, one of the most-prolific observers of contemporary technology, gave advice to consumers of quant work products [Davenport, 2013]. He advises consumers to continuously challenge their quants' work products. A Harvard Business Review piece published just days before this writing raises warning flags about misunderstandings between data scientists the organizational leaders they support [J. Shapiro, 2020]. A McKinsey Global Survey in 2019 identified "explainability" — the ability to explain how AI models come to their decisions — as a key risk in its adoption [McKinsey, 2019].
The metrology community — specialists in standards for engineering, industrial, and scientific measurements — treat uncertainty very seriously. They take the position that "that a measurement result is informative if and only if it includes also an estimate of the incompleteness of this information" [S. Salicone, M. Prioli, 2018]. The same considerations certainly apply to data-science work products based on statistical estimation.
What about the Census Bureau's deidentification methodology? It seems that clearer, more-transparent communication might have headed off Wezerek's and Van Riper's misunderstanding. Quite obviously the bureau faces perhaps-unprecedented pressures in its conduct of the 2020 census. Nonetheless, the public's trust in its data is fragile. No opportunity can be spared to mitigate against the risks of the spread of nihilism.
Part-time Lecturer@ Northeastern University | Writer
5 年Great article, Neil!