Data exploration for cleaning data!
Dr. Abhishek Kadam
Applying automation, data science, AI and ML to simplify clinical data management.
Hey Data Managers,
Yet another simplification. But this time around I need you to experiment a bit and post the experience in the comments.
This is about visual data exploration. For a data science project a lot of time is spent in data exploration. This is because many times the data belongs to a field of work that a data science team is not familiar with or at times the business team may not be able to provide adequate information on their data. There could be many such reasons.?
For clinical data managers data exploration brings a new way of looking at the familiar data. Data manager can easily find out patterns, anomalies, outliers, sentiment, from the data they are used to working with daily.
Why to take the trouble of exploring data?
Well, I have lived life of a clinical data manager, and I can tell you that I missed opportunities to deliver value to my stakeholders i. e. the study team by not knowing the data beyond the data discrepancies. Exploring data helps provide that value to your study team. Just imagine you alerting the study team of a trend developing at a specific site, a trend emerging of under reporting, a view of outliers on the data points that are important etc. They will value these inputs and look up to a clinical data manager as a clinical data consultant, a clinical data scientist!?
How to do this?
To provide this value is simple. If you have the right skills, all it takes is to write and validate a small pieces of code to explore the data. The presumption here is that the clinical data manager know how to differentiate categorical and continuous data, the data manager knows per the study protocol which are the most critical data points.
Lets take an example of clinical categorical data. This is typically explored as counts. The frequency of categories is measured. Finding early about a possible "class imbalance" e.g. the data showing high numbers of females v/s males taking part of the trial whereas the trial is designed to have equal numbers in both classes. It is possible that may be some one in the trial team does it. Chances are no one does this proactively.
Continuous data can be well explored and presented as summaries. I refer to these summaries as "five point summary". It gives a quick view of a data set in terms of mean, median, standard deviations, min and max range. Just a crisp table of such summaries by vital data points giving periodic view could be of great value to the study teams.?
There are many such interesting ways to view the data.?
Here are some commonly used codes chunks for data exploration.
Let's explore data
Importing relevant package in python.
Libraries used-? Python - Pandas, NumPy, Matplotlib, Seaborn. These should be sufficient to try looking at data differently.
领英推荐
Reading the Clinical dataset e.g.
Let us now dive in to variable level data exploration. All the codes are illustration and when used with appropriate logic can do wonders in identifying trends and peculiarities in the dataset.
Vitals.bmi.describe() This simple code will give you a summary table of count, mean, min, max, Std.dev., and 25, 50 and 75 percentile. 50 percentile is the median.
The above can be easily visualised by using following code.
sns.boxplot(vitals["BMI"])
This code will show you the outliers in the dataset for BMI values.?
If you have a composite data set of vitals and demography, you can split the data by gender. If you have subject characteristics part of your data set, you can look at the data per relevant subject characteristics.
I have just scratched the surface; there is more that you can do.?You will need a jupyter notebook to experiment with the code above and play around.
Go ahead, try out and let me know what else could you explore.
Reskill to transform! Stay Relevant! Lead with empathy