登录查看更多内容

“Where is my Variable?!”: Data Documentation and Answering User Questions

Monika Wahi

Epidemiology & Biostatistics Consultant a/k/a Data Scientist | Exclusive and innovative solutions for data science challenges in public health, research and education

发布日期: 2023年3月4日

*I may be compensated when you click on my links, but you will not be charged.

Have you ever been in a situation where you are aware of some variables in a dataset – but you don’t recognize them when they are shown to you in an analysis? I’ll give you three examples from my experience.

Research Data Collection Scenario: Professionals involved in data collection (including clinicians) are aware of certain measurements they took. Yet, when biostatisticians present results, these data collection professionals don’t find them familiar. It’s like they don’t recognize the variables they know they measured as part of their data collection.
Data Mart Scenario: Professionals involved in maintaining data source systems are also involved in analyzing data from a data mart (small data warehouse) that gets source datasets and integrates them. However, when the professionals go to the data mart to analyze the data, they are confused, and don’t find the variables in them familiar. Essentially, they don’t recognize the source dataset variables with which they are familiar in the data mart.
Name Change Scenario: Analyst Jennifer runs a SQL database, and takes an extract from it to do research. Jennifer does some analyses in SQL using the native variable names (which are long). Analyst Vena is asked to replicate Jennifer’s analysis, so Jennifer gives Vena access to a SQL view with the extract in it that Jennifer used in research. Vena copies the data into her SAS environment, but changes the names of all the variables, because many are hitting SAS limitations. When Vena presents her replicated analysis to Jennifer, Jennifer doesn’t recognize it and can’t interpret it, because it uses Vena’s SAS field names, not Jennifer’s SQL field names.

How Do You Solve This Problem?

On one hand, you have the big picture problem, but also, you have three little picture problems. The solution to each little picture problem above is to do forensic research and create documentation that allows all stakeholders to understand what is going on.

In Scenario #1, that documentation would unite the understanding of the data collectors and their variables with the results presented by the biostatisticians.
In Scenario #2, that documentation would unite the understanding of the people running the data mart and the other team maintaining the source systems and analyzing the data from the data mart.
In Scenario #3, Jennifer and Vena would just need to agree on a way of documenting a crosswalk between Jennifer’s SQL field names, and Vena’s SAS field names – again, a matter of doing forensics, and creating documentation that will engender a common understanding among stakeholders.

What Documentation Do You Make?

I always preface the answer to this question with the caveat that it really depends on the specific scenario. But in a generic sense, there are a few items which I describe in more detail in my blog post on the topic that can great help unite understanding across stakeholder groups.

Data Dictionary: At the very least, you should have a data dictionary focused on documenting the common variables that are the subject of communication (and potential misunderstanding) among your stakeholder groups. Extensive data dictionaries that include well-documented picklists and crosswalks are encouraged, because these only increase understanding. Here is an example of one component of a data dictionary from a vehicle registration database that I found on Wikimedia Commons.

Data & Analytics 1 年前

What is a Database Schema?

Doug Rose 2 个月前

Counting Distinct Values in SQL: Tips and Examples

StrataScratch 2 个月前

Here is an example of a data dictionary for a vehicle registration database. — Ethacke1, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons, available here: https://commons.wikimedia.org/wiki/File:Data_Dictionary.png

Well-organized Extract-Transform-Load (ETL) Code and Pipeline Documentation: Many times when I talk to people in the health analytics space about ETL, they indicate they do not know what that is. If you want to learn more about ETL, I encourage you to take my online course in Application Basics. But briefly, ETL is what you do to take raw data you receive from a data provider (such as raw data you download from a SurveyMonkey survey) and turn it into the analytic dataset you use for your research analysis. Below are examples I found on Wikimedia Commons of data processing documentation that looks like it is documenting some sort of data transformation pipeline.

Here is a diagram showing the steps in a pipeline where data are getting backed-up. — Sukari at English Wikipedia, Public domain, via Wikimedia Commons, available here: https://commons.wikimedia.org/wiki/File:Backup-DFD.png?uselang=zh-my

This flow chart shows the steps in data processing for the "COB Bucket"?. — Alexji78, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons, available here: https://upload.wikimedia.org/wikipedia/commons/0/0f/COB_BUCKET_FLOW_CHART.jpg

Final Thoughts

When you make and share documentation that everyone can understand – using basic tools like Microsoft Excel, Word and PowerPoint – then you can increase understanding across stakeholders about the data without actually changing how you share the data. It also “solves the problem” – the next time the issue comes up, you just have to pull up the documentation of the results of your forensics, and remind everyone of the answer.

So the next time you hear the question, “Where is my variable?!” consider making some of the documentation described in this article to try to help you and your stakeholders figure out the answer – once and for all!

Want to kick your health analytics career up a notch? Then click here to sign up for a 30-minute market research Zoom meeting with me, where I will explain a new, exclusive group mentoring program for health data professionals, and get your feedback.

Monika M. Wahi, MPH, CPH is a LinkedIn Learning author of data science courses, a book on how to design and build SAS data warehouses, and the co-author of many peer-reviewed publications. Sign up for her weekly e-newsletter, and follow her blog and YouTube channel for learning resources!

Joseph Chantiny

Warranty Engineer at Hitachi Astemo

1 年

My answers are straight forward: no and yes.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

“Where is my Variable?!”: Data Documentation and Answering User Questions

Monika Wahi

Epidemiology & Biostatistics Consultant a/k/a Data Scientist | Exclusive and innovative solutions for data science challenges in public health, research and education

How Do You Solve This Problem?

What Documentation Do You Make?

领英推荐

Final Thoughts

更多精彩文章

社区洞察

其他会员也浏览了

What Are the Steps to Cast INT in SQL for Type Conversion?

Utilizing DENSE_RANK for Data Deduplication in SQL

A Step-by-Step Guide: How to Convert Tables to Graph

Updated: Difference Between Business Intelligence and Data Science

TIQ Part 3 – Ultimate Guide to Date dimension creation

Business Intelligence Team Member Discovers Data Isn't Magically Generated Overnight

Data Lineage: What It Is and Why It Matters

Meet Ultipa Manager: Easy Data Migration

The top 3 things data engineers can stop spending time on

Give Your Data a Bath: Ingesting SQL as RDF

How Do You Solve This Problem?

What Documentation Do You Make?

领英推荐

Final Thoughts

3 Simple Ways to Improve AI Right Now

2023年6月3日

WISE Summit: The Perfect Event for Women Entrepreneurs

2023年3月24日

Bias in AI, and What Women are Doing About it

2023年3月23日

Want to Increase Citations to your Scientific Publications? Introducing CitePeeps!

2023年3月16日

REDCap: What it is, and Why I Avoid Using it if I Can

2023年3月7日

Don’t Miss These March 2023 Data Science and Health Analytics Deadlines and Events!

2023年2月28日

Want to Turn Your Health Analytics Background into a Data Science Career? It’s Not Easy!

2022年11月18日

Developing Clinically-Useful Data Visualizations is Too Expensive and Takes Too Long

2022年10月24日

Can Statistics Get Lost in Translation? Business vs. Healthcare Statistics

2022年6月4日

Highlights from Invent New England Product Pitches

2022年5月9日

社区洞察

其他会员也浏览了

What Are the Steps to Cast INT in SQL for Type Conversion?

Utilizing DENSE_RANK for Data Deduplication in SQL

A Step-by-Step Guide: How to Convert Tables to Graph

Updated: Difference Between Business Intelligence and Data Science

TIQ Part 3 – Ultimate Guide to Date dimension creation

Business Intelligence Team Member Discovers Data Isn't Magically Generated Overnight

Data Lineage: What It Is and Why It Matters

Meet Ultipa Manager: Easy Data Migration

The top 3 things data engineers can stop spending time on

Give Your Data a Bath: Ingesting SQL as RDF