What are “Microdata?”
Microdata are individual records that represent a specific person, household, establishment, or another unit of analysis. The photo above shows microdata in their most visceral form: demographic information that represents a person being punched into a piece of cardboard, part of the automated data processing for the 1940 US Census.
If you create a survey with a system like Google Forms or Qualtrix, these services will let you download the survey responses as a single spreadsheet. Here, the microdata are the individual rows of the spreadsheet—each row is an individual survey response. These are the microdata in their most raw and most difficult-to-consume form. That's because these data can be incomplete (not every question gets answered) and incorrect (not every question may be answered correctly).
When I joined the US Census Bureau back in 2017, one of my biggest surprises was learning that the Census Bureau edits the survey responses that it receives. I had always thought that administering a survey was like measuring the pH of an acid in a chemistry lab: changing the data that you collected was somehow akin to scientific fraud.
I have since learned that it is common to edit survey responses. Sometimes the editing is as simple as throwing out the microdata that are incomplete or obviously incorrect. But editing can be far more involved. What's important is to make sure that all of the data are edited using rules that are both clear and that are consistently applied. Otherwise, the editing can do significant harm to the scientific validity of any analysis for which you might use the data. (Dan Bouk's book Democracy's Data: The Hidden Stories in the US Census and How to Read Them, goes into detail about some of the edits that were made as part of the 1940 Census.)
Students in some data science courses frequently download and analyze publicly available microdata. For example, the Integrated Public Use Microdata Series USA website, better known as IPUMS USA, allows students, researchers, and pretty much anyone with a valid purpose to download a range of microdata from the US decennial census as well as the more detailed American Community Surveys (ACS). These datasets contain individual records that represent individual people, bound together with an additional record that represents the residence in which they lived—and in some cases, may still live. That is, these are real data from real people. For this reason, the IPUMS website extolls researchers to "USE IT FOR GOOD—NEVER FOR EVIL."
(The creator of IPUMS, Steven Ruggles, was recently awarded a MacArthur Foundation fellowship in recognition of his work "setting new standards in quantitative historical research by building the world's largest publicly available database of population statistics.")
领英推荐
The data collected by the US Census Bureau were collected under a pledge of confidentiality, so the microdata released by the Census Bureau are processed before they are released to make it difficult for researchers to identify the specific person on which each microdata record is based. For example, instead of reporting each person's street address and city, each record is instead located within a Public Use Microdata Area (PUMA), a statistical geographic area that contains no fewer than 100,000 people.
Here are some sources to consider if you are looking for microdata:
Statistical microdata are useful for academics conducting research, but they can also be useful for businesses trying to understand potential markets.
_____________
Simson Garfinkel was the Senior Computer Scientist for Confidentiality and Data Access during the 2020 Census and was then the Senior Data Scientist at the US Department of Homeland Security, where he worked on data governance issues. His most recent book Law and Policy for the Quantum Age is now available as an audiobook from Audible.
That's fantastic! Embracing data can lead to insightful discoveries - as one might say, understanding data offers the key to creating impactful strategies. Keep inspiring! ????
Science and technology policy and strategy expert, mentor, advisor, and author
2 年How do you fit “data cleaning” in?
Using data to build better work environments
2 年Hi Simson Garfinkel, thanks for helping explain the interesting world of census data. I've used the American Community Survey in some HR practices to develop job pay rates for my employees (combining it with paid market data). Our group Human Resources Science (https://www.dhirubhai.net/groups/12704693/) is following your work to see how else we could use different datasets to improve our HR practices. Thanks for everything you're doing!
COO @ Pasteur Labs | Product strategy, experience design | Sociotechnical systems, responsible innovation | McKinsey & Meta alum, Navy Veteran
2 年Lookin forward to reading it!
Scientific writer
2 年Signing up!