What are “Microdata?”
The punch cards for the 1940 Census are examples of microdata. Photo courtesy US Census Bureau.

What are “Microdata?”

Microdata are individual records that represent a specific person, household, establishment, or another unit of analysis. The photo above shows microdata in their most visceral form: demographic information that represents a person being punched into a piece of cardboard, part of the automated data processing for the 1940 US Census.

If you create a survey with a system like Google Forms or Qualtrix, these services will let you download the survey responses as a single spreadsheet. Here, the microdata are the individual rows of the spreadsheet—each row is an individual survey response. These are the microdata in their most raw and most difficult-to-consume form. That's because these data can be incomplete (not every question gets answered) and incorrect (not every question may be answered correctly).

When I joined the US Census Bureau back in 2017, one of my biggest surprises was learning that the Census Bureau edits the survey responses that it receives. I had always thought that administering a survey was like measuring the pH of an acid in a chemistry lab: changing the data that you collected was somehow akin to scientific fraud.

I have since learned that it is common to edit survey responses. Sometimes the editing is as simple as throwing out the microdata that are incomplete or obviously incorrect. But editing can be far more involved. What's important is to make sure that all of the data are edited using rules that are both clear and that are consistently applied. Otherwise, the editing can do significant harm to the scientific validity of any analysis for which you might use the data. (Dan Bouk's book Democracy's Data: The Hidden Stories in the US Census and How to Read Them, goes into detail about some of the edits that were made as part of the 1940 Census.)

Students in some data science courses frequently download and analyze publicly available microdata. For example, the Integrated Public Use Microdata Series USA website, better known as IPUMS USA, allows students, researchers, and pretty much anyone with a valid purpose to download a range of microdata from the US decennial census as well as the more detailed American Community Surveys (ACS). These datasets contain individual records that represent individual people, bound together with an additional record that represents the residence in which they lived—and in some cases, may still live. That is, these are real data from real people. For this reason, the IPUMS website extolls researchers to "USE IT FOR GOOD—NEVER FOR EVIL."

(The creator of IPUMS, Steven Ruggles, was recently awarded a MacArthur Foundation fellowship in recognition of his work "setting new standards in quantitative historical research by building the world's largest publicly available database of population statistics.")

The data collected by the US Census Bureau were collected under a pledge of confidentiality, so the microdata released by the Census Bureau are processed before they are released to make it difficult for researchers to identify the specific person on which each microdata record is based. For example, instead of reporting each person's street address and city, each record is instead located within a Public Use Microdata Area (PUMA), a statistical geographic area that contains no fewer than 100,000 people.

Here are some sources to consider if you are looking for microdata:

  • IPUMS, with more than nine different broad categories including demographics, global health, geography, time use, and higher education.
  • The World Bank Microdata Library, with over 4000 datasets.
  • The US Census Bureau's Restricted-Use Microdata, which can only be accessed from the Federal Statistical Research Data Centers by qualified researchers who have undergone a background check and have a pre-approved scientifically valid research project.

Statistical microdata are useful for academics conducting research, but they can also be useful for businesses trying to understand potential markets.

_____________

Simson Garfinkel was the Senior Computer Scientist for Confidentiality and Data Access during the 2020 Census and was then the Senior Data Scientist at the US Department of Homeland Security, where he worked on data governance issues. His most recent book Law and Policy for the Quantum Age is now available as an audiobook from Audible.

That's fantastic! Embracing data can lead to insightful discoveries - as one might say, understanding data offers the key to creating impactful strategies. Keep inspiring! ????

回复
Marjory Blumenthal

Science and technology policy and strategy expert, mentor, advisor, and author

2 年

How do you fit “data cleaning” in?

回复
John Kelly IV

Using data to build better work environments

2 年

Hi Simson Garfinkel, thanks for helping explain the interesting world of census data. I've used the American Community Survey in some HR practices to develop job pay rates for my employees (combining it with paid market data). Our group Human Resources Science (https://www.dhirubhai.net/groups/12704693/) is following your work to see how else we could use different datasets to improve our HR practices. Thanks for everything you're doing!

回复
M. Alejandra Parra-Orlandoni (mapo)

COO @ Pasteur Labs | Product strategy, experience design | Sociotechnical systems, responsible innovation | McKinsey & Meta alum, Navy Veteran

2 年

Lookin forward to reading it!

回复
Wendy Wolfson

Scientific writer

2 年

Signing up!

回复

要查看或添加评论,请登录

Simson Garfinkel的更多文章

  • A Modest Proposal, or the Sound of Inevitability

    A Modest Proposal, or the Sound of Inevitability

    For preventing the end of journalism, ending the dependence of AI systems on data “scraped” from the Internet, and…

    13 条评论
  • Noisy Outtakes

    Noisy Outtakes

    My book Differential Privacy will be published March 25 by MIT Press. The book is part of the “Essential Knowledge…

    3 条评论
  • Spooky Data at a Distance

    Spooky Data at a Distance

    As Halloween fast approaches, I thought it would be fun to recount a dinner talk that I gave several years ago on a…

    6 条评论
  • Trust and Safety

    Trust and Safety

    If your website or service allows users to post comments or exchange messages with other users, then you will…

    5 条评论
  • Review: Claire Bowen's "Government Data of the People"

    Review: Claire Bowen's "Government Data of the People"

    As governments and corporations make increasingly more use of our personal data, a growing number of computer…

    3 条评论
  • Metasearch: Search and RAG multiple datasets without data governance chaos

    Metasearch: Search and RAG multiple datasets without data governance chaos

    Metasearch systems take your query, send it to multiple search engines, and then show you the combined results. Most…

    3 条评论
  • Vector Databases and RAG

    Vector Databases and RAG

    “You Do Not Need a Vector Database” is the provocative title of a recent blog post (with code) by Dr. Yucheng Low…

    12 条评论
  • Testing the family china for lead

    Testing the family china for lead

    In this issue I take a break from data and talk about something physical. This is Jerry Urban from Inspector 3755, his…

    6 条评论
  • Sensitive Locations

    Sensitive Locations

    Do you work in a sensitive location? On January 9th, the US Federal Trade Commission settled a case with data broker…

    4 条评论
  • WHOOP's AI (LLM) Coach

    WHOOP's AI (LLM) Coach

    In September, I joined the WHOOP Coach beta program, a new feature that WHOOP recently added to its popular fitness…

    2 条评论

社区洞察

其他会员也浏览了