“Where is my Variable?!”: Data Documentation and Answering User Questions
Monika Wahi
Epidemiology & Biostatistics Consultant a/k/a Data Scientist | Exclusive and innovative solutions for data science challenges in public health, research and education
*I may be compensated when you click on my links, but you will not be charged.
Have you ever been in a situation where you are aware of some variables in a dataset – but you don’t recognize them when they are shown to you in an analysis? I’ll give you three examples from my experience.
How Do You Solve This Problem?
On one hand, you have the big picture problem, but also, you have three little picture problems. The solution to each little picture problem above is to do forensic research and create documentation that allows all stakeholders to understand what is going on.
What Documentation Do You Make?
I always preface the answer to this question with the caveat that it really depends on the specific scenario. But in a generic sense, there are a few items which I describe in more detail in my blog post on the topic that can great help unite understanding across stakeholder groups.
Data Dictionary: At the very least, you should have a data dictionary focused on documenting the common variables that are the subject of communication (and potential misunderstanding) among your stakeholder groups. Extensive data dictionaries that include well-documented picklists and crosswalks are encouraged, because these only increase understanding. Here is an example of one component of a data dictionary from a vehicle registration database that I found on Wikimedia Commons.
领英推荐
Well-organized Extract-Transform-Load (ETL) Code and Pipeline Documentation: Many times when I talk to people in the health analytics space about ETL, they indicate they do not know what that is. If you want to learn more about ETL, I encourage you to take my online course in Application Basics. But briefly, ETL is what you do to take raw data you receive from a data provider (such as raw data you download from a SurveyMonkey survey) and turn it into the analytic dataset you use for your research analysis. Below are examples I found on Wikimedia Commons of data processing documentation that looks like it is documenting some sort of data transformation pipeline.
Final Thoughts
When you make and share documentation that everyone can understand – using basic tools like Microsoft Excel, Word and PowerPoint – then you can increase understanding across stakeholders about the data without actually changing how you share the data. It also “solves the problem” – the next time the issue comes up, you just have to pull up the documentation of the results of your forensics, and remind everyone of the answer.
So the next time you hear the question, “Where is my variable?!” consider making some of the documentation described in this article to try to help you and your stakeholders figure out the answer – once and for all!
Want to kick your health analytics career up a notch? Then click here to sign up for a 30-minute market research Zoom meeting with me, where I will explain a new, exclusive group mentoring program for health data professionals, and get your feedback.
Monika M. Wahi, MPH, CPH is a LinkedIn Learning author of data science courses, a book on how to design and build SAS data warehouses, and the co-author of many peer-reviewed publications. Sign up for her weekly e-newsletter, and follow her blog and YouTube channel for learning resources!
Warranty Engineer at Hitachi Astemo
1 年My answers are straight forward: no and yes.