“Where is my Variable?!”: Data Documentation and Answering User Questions
Sometimes you know about certain data points, but you don't recognize them in the analysis. Hm!

“Where is my Variable?!”: Data Documentation and Answering User Questions

*I may be compensated when you click on my links, but you will not be charged.

Have you ever been in a situation where you are aware of some variables in a dataset – but you don’t recognize them when they are shown to you in an analysis? I’ll give you three examples from my experience.

  1. Research Data Collection Scenario: Professionals involved in data collection (including clinicians) are aware of certain measurements they took. Yet, when biostatisticians present results, these data collection professionals don’t find them familiar. It’s like they don’t recognize the variables they know they measured as part of their data collection.
  2. Data Mart Scenario: Professionals involved in maintaining data source systems are also involved in analyzing data from a data mart (small data warehouse) that gets source datasets and integrates them. However, when the professionals go to the data mart to analyze the data, they are confused, and don’t find the variables in them familiar. Essentially, they don’t recognize the source dataset variables with which they are familiar in the data mart.
  3. Name Change Scenario: Analyst Jennifer runs a SQL database, and takes an extract from it to do research. Jennifer does some analyses in SQL using the native variable names (which are long). Analyst Vena is asked to replicate Jennifer’s analysis, so Jennifer gives Vena access to a SQL view with the extract in it that Jennifer used in research. Vena copies the data into her SAS environment, but changes the names of all the variables, because many are hitting SAS limitations. When Vena presents her replicated analysis to Jennifer, Jennifer doesn’t recognize it and can’t interpret it, because it uses Vena’s SAS field names, not Jennifer’s SQL field names.

How Do You Solve This Problem?

On one hand, you have the big picture problem, but also, you have three little picture problems. The solution to each little picture problem above is to do forensic research and create documentation that allows all stakeholders to understand what is going on.

  • In Scenario #1, that documentation would unite the understanding of the data collectors and their variables with the results presented by the biostatisticians.
  • In Scenario #2, that documentation would unite the understanding of the people running the data mart and the other team maintaining the source systems and analyzing the data from the data mart.
  • In Scenario #3, Jennifer and Vena would just need to agree on a way of documenting a crosswalk between Jennifer’s SQL field names, and Vena’s SAS field names – again, a matter of doing forensics, and creating documentation that will engender a common understanding among stakeholders.

What Documentation Do You Make?

I always preface the answer to this question with the caveat that it really depends on the specific scenario. But in a generic sense, there are a few items which I describe in more detail in my blog post on the topic that can great help unite understanding across stakeholder groups.

Data Dictionary: At the very least, you should have a data dictionary focused on documenting the common variables that are the subject of communication (and potential misunderstanding) among your stakeholder groups. Extensive data dictionaries that include well-documented picklists and crosswalks are encouraged, because these only increase understanding. Here is an example of one component of a data dictionary from a vehicle registration database that I found on Wikimedia Commons.

Here is an example of a data dictionary for a vehicle registration database.
Ethacke1, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons, available here: https://commons.wikimedia.org/wiki/File:Data_Dictionary.png

Well-organized Extract-Transform-Load (ETL) Code and Pipeline Documentation: Many times when I talk to people in the health analytics space about ETL, they indicate they do not know what that is. If you want to learn more about ETL, I encourage you to take my online course in Application Basics. But briefly, ETL is what you do to take raw data you receive from a data provider (such as raw data you download from a SurveyMonkey survey) and turn it into the analytic dataset you use for your research analysis. Below are examples I found on Wikimedia Commons of data processing documentation that looks like it is documenting some sort of data transformation pipeline.

Here is a diagram showing the steps in a pipeline where data are getting backed-up.
Sukari at English Wikipedia, Public domain, via Wikimedia Commons, available here: https://commons.wikimedia.org/wiki/File:Backup-DFD.png?uselang=zh-my
This flow chart shows the steps in data processing for the "COB Bucket"?.
Alexji78, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons, available here: https://upload.wikimedia.org/wikipedia/commons/0/0f/COB_BUCKET_FLOW_CHART.jpg

Final Thoughts

When you make and share documentation that everyone can understand – using basic tools like Microsoft Excel, Word and PowerPoint – then you can increase understanding across stakeholders about the data without actually changing how you share the data. It also “solves the problem” – the next time the issue comes up, you just have to pull up the documentation of the results of your forensics, and remind everyone of the answer.

So the next time you hear the question, “Where is my variable?!” consider making some of the documentation described in this article to try to help you and your stakeholders figure out the answer – once and for all!

Want to kick your health analytics career up a notch? Then click here to sign up for a 30-minute market research Zoom meeting with me, where I will explain a new, exclusive group mentoring program for health data professionals, and get your feedback.

Monika M. Wahi, MPH, CPH is a LinkedIn Learning author of data science courses, a book on how to design and build SAS data warehouses, and the co-author of many peer-reviewed publications. Sign up for her weekly e-newsletter, and follow her blog and YouTube channel for learning resources!

Joseph Chantiny

Warranty Engineer at Hitachi Astemo

1 年

My answers are straight forward: no and yes.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了