登录查看更多内容

Data exploration for cleaning data!

Dr. Abhishek Kadam

Applying automation, data science, AI and ML to simplify clinical data management.

发布日期: 2022年3月10日

Hey Data Managers,

Yet another simplification. But this time around I need you to experiment a bit and post the experience in the comments.

This is about visual data exploration. For a data science project a lot of time is spent in data exploration. This is because many times the data belongs to a field of work that a data science team is not familiar with or at times the business team may not be able to provide adequate information on their data. There could be many such reasons.?

For clinical data managers data exploration brings a new way of looking at the familiar data. Data manager can easily find out patterns, anomalies, outliers, sentiment, from the data they are used to working with daily.

Why to take the trouble of exploring data?

Well, I have lived life of a clinical data manager, and I can tell you that I missed opportunities to deliver value to my stakeholders i. e. the study team by not knowing the data beyond the data discrepancies. Exploring data helps provide that value to your study team. Just imagine you alerting the study team of a trend developing at a specific site, a trend emerging of under reporting, a view of outliers on the data points that are important etc. They will value these inputs and look up to a clinical data manager as a clinical data consultant, a clinical data scientist!?

How to do this?

To provide this value is simple. If you have the right skills, all it takes is to write and validate a small pieces of code to explore the data. The presumption here is that the clinical data manager know how to differentiate categorical and continuous data, the data manager knows per the study protocol which are the most critical data points.

Lets take an example of clinical categorical data. This is typically explored as counts. The frequency of categories is measured. Finding early about a possible "class imbalance" e.g. the data showing high numbers of females v/s males taking part of the trial whereas the trial is designed to have equal numbers in both classes. It is possible that may be some one in the trial team does it. Chances are no one does this proactively.

Continuous data can be well explored and presented as summaries. I refer to these summaries as "five point summary". It gives a quick view of a data set in terms of mean, median, standard deviations, min and max range. Just a crisp table of such summaries by vital data points giving periodic view could be of great value to the study teams.?

There are many such interesting ways to view the data.?

Here are some commonly used codes chunks for data exploration.

Let's explore data

Importing relevant package in python.

Libraries used-? Python - Pandas, NumPy, Matplotlib, Seaborn. These should be sufficient to try looking at data differently.

领英推荐

A Journey Through Exploratory Data Analysis

Data & Analytics 4 周前

Uncover Insights using Exploratory Data Analysis (EDA)

Techcanvass 8 个月前

What Is Data Exploration? A Simple Guide On Types…

Ze Learning Labb 1 个月前

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as stats

Reading the Clinical dataset e.g.

Vitals= pd.read_csv('vitals.csv')
Vitals.head() - Specifiy a number in bracket to see those many numbers of rows. By default first five would be shown.
Vitals_shape = vitals.shape Vitals_shape : This code will give output as (x,y) where x is number of rows and y is number of columns.
Vitals.info() This simple line of code helps you understand data type in the Vitals dataset. Data type could be integers, object, float etc. The code .info gives a summary counts of non missing values per variable already.
Vitals.isnull() This code will allow you to identify if there are missing values in the data set or not.

Let us now dive in to variable level data exploration. All the codes are illustration and when used with appropriate logic can do wonders in identifying trends and peculiarities in the dataset.

Vitals.bmi.describe() This simple code will give you a summary table of count, mean, min, max, Std.dev., and 25, 50 and 75 percentile. 50 percentile is the median.

The above can be easily visualised by using following code.

sns.boxplot(vitals["BMI"])

This code will show you the outliers in the dataset for BMI values.?

If you have a composite data set of vitals and demography, you can split the data by gender. If you have subject characteristics part of your data set, you can look at the data per relevant subject characteristics.

I have just scratched the surface; there is more that you can do.?You will need a jupyter notebook to experiment with the code above and play around.

Go ahead, try out and let me know what else could you explore.

Reskill to transform! Stay Relevant! Lead with empathy

要查看或添加评论，请登录

Dr. Abhishek Kadam的更多文章

Simplifying Logistic Regression for Clinical Data Managers

2024年7月7日

Simplifying Logistic Regression for Clinical Data Managers

1.1 Introduction to Logistic Regression Logistic regression is used to classify data points into one of two or more…

8 条评论
Simplifying Linear Regression for Clinical Data Managers

2024年7月1日

Simplifying Linear Regression for Clinical Data Managers

1 Linear Regression 1.1 Introduction Linear regression is a simple yet powerful statistical technique used to…

1 条评论
Clinical Data Science - An art of applying data science to clinical data management.

2023年2月25日

Clinical Data Science - An art of applying data science to clinical data management.

Clinical Data Science - Clinical data science I believe is in fact an art of applying Data Science to clinical trial…

6 条评论
It is very difficult to reskill. Is there a shortcut?

2022年2月19日

It is very difficult to reskill. Is there a shortcut?

Another weekend and another simplification. This week I have tried to simplify a big question that non- technical…
A.R.M. your teams to win!

2022年2月16日

A.R.M. your teams to win!

A.R.
R.I.S.E & STAY RELEVANT

2022年2月12日

R.I.S.E & STAY RELEVANT

R.I.
Finding time to reskill

2022年2月5日

Finding time to reskill

Hey Abhishek, " I have found a skill to learn. I know if I pursue learning the new skill, it will change my life.

1 条评论
Critical Thinking - A common character in a leader and a data scientist!

2022年2月2日

Critical Thinking - A common character in a leader and a data scientist!

Hey all, I was asked recently, what is the commonality in a Leader and a Data Scientist? To be honest, I was not able…
I Realize!

2022年2月1日

I Realize!

Do you find yourself realizing you have a problem of growing in your career? Do you find yourself blaming other in the…
Six stages of a machine learning project

2022年1月29日

Six stages of a machine learning project

Data collection – Collecting data to understand the problem to be solved. Collecting data from single or multiple…

See all articles

Data exploration for cleaning data!

Dr. Abhishek Kadam

Applying automation, data science, AI and ML to simplify clinical data management.

Importing relevant package in python.

领英推荐

Reading the Clinical dataset e.g.

Dr. Abhishek Kadam的更多文章

社区洞察

其他会员也浏览了

Unmasking Real-World Data Science: A Departure from Kaggle’s Accuracy Frenzy and Model-Centric Approaches

What is Data Observability? Do you need it?

Data Science Approaches to Data Quality: From Raw Data to Datasets

Data Analytics Questions to Ask for Better Data Analysis

Avoiding Common Mistakes in Data Science: A Complete Guide

Debunking Data Myths

From Data Newbie to Pro: The 4 Essential Skills Every Young Data Scientist Needs!

The Importance of Citizen Data Science in a Data-Driven World

Big Data and Data Science - Transforming Insights into Innovation

Bad-Viz: The Silent Killer of Data Science Careers - The Shocking Truth About How Poor Data Visualisation "Hurts".

Importing relevant package in python.

领英推荐

Reading the Clinical dataset e.g.

Dr. Abhishek Kadam的更多文章

Simplifying Logistic Regression for Clinical Data Managers

Simplifying Linear Regression for Clinical Data Managers

Clinical Data Science - An art of applying data science to clinical data management.

It is very difficult to reskill. Is there a shortcut?

A.R.M. your teams to win!

R.I.S.E & STAY RELEVANT

Finding time to reskill

Critical Thinking - A common character in a leader and a data scientist!

I Realize!

Six stages of a machine learning project

社区洞察

其他会员也浏览了

Unmasking Real-World Data Science: A Departure from Kaggle’s Accuracy Frenzy and Model-Centric Approaches

What is Data Observability? Do you need it?

Data Science Approaches to Data Quality: From Raw Data to Datasets

Data Analytics Questions to Ask for Better Data Analysis

Avoiding Common Mistakes in Data Science: A Complete Guide

Debunking Data Myths

From Data Newbie to Pro: The 4 Essential Skills Every Young Data Scientist Needs!

The Importance of Citizen Data Science in a Data-Driven World

Big Data and Data Science - Transforming Insights into Innovation

Bad-Viz: The Silent Killer of Data Science Careers - The Shocking Truth About How Poor Data Visualisation "Hurts".