Data Cleaning Challenge
Gbenga Obikoya
Data Analysis Ninja: Transforming Raw Data into Actionable Insights Across Industries | Data Analyst with a Clinical Edge| Public Speaker | Content Creator
This is a data cleaning challenge created by GreenData Solutions . I decided to join due to my love for Excel and Power Query, because I want to help others and because I have not participated in a data cleaning challenge. Thanks to Chisom Ibemere for this initiative.
Project Description:
You are hired as a Data Analytics Consultant in a healthcare system. Your primary responsibility is to analyze and clean messy data related to patients' health records. This involves identifying and addressing inconsistencies, missing values, and formatting errors, all of which are commonly encountered in real-world datasets.
Tools Used: Excel & Power Query
Dataset:
The dataset is a healthcare data provided in csv format. The data contains 1,659 rows and 12 columns.
Below is the data dictionary provided:
Here is a snapshot of the original dirty data:
Let's get started.
I take it as a principle to carry out 2 steps before data cleaning proper:
Step 1: Create a back-up of the original dataset
This first thing I did was to create a copy of the original dataset. This creates a backup for the original dataset
Step 2: Convert the dataset to a table
Converting a dataset into a table in Excel before starting data cleaning offers several advantages which include:
Cool right? Let's start cleaning
Note: The scope of this cleaning would be column-wise which I see to be the best technique. Therefore, I would be taking the columns one after the other.
Cleaning ID column
From the data dictionary above. It is stated that the ID is a unique identifier of each indivual or record.
Assumption note: I took each record as unique to each patient. In other words, I assume this to be a data of patients in an healthcare facility.
As a result, the ID is the most important column in this dataset.
From above, there are 286 cases of duplicated IDs. However, after further exploration, I discovered that many of these duplicated IDs have different names, age and other variables. Perhaps they are not true duplicates afterall.
This would only mean that the ID are not unique identifiers of each individual or patient since we now have 2 different individuals (i.e with different demographics) that have the same ID.
In essence, these duplicate IDs does not translate to duplicte individuals.
But still, are there true duplicates? That is, duplicate IDs with the same demographics (eg, Name, age etc)?
I will apply the same formatting to the Name column and filter by color,
BUT we need to clean the Name column first, the format is quite ugly and can confuse Excel, leading to a bit of frustration on my part. I hate to be frustrated :).
Cleaning the Name Column:
As we can see from the dirty data above above that the Name column is pretty ugly.
The goal is have the names in Proper style and slit it into two columns: First Name and Last Name as this helps make easier analysis.
Assumption note: I assume that the first name and the second name in the column are the First and Last name of the individuals.
This step and subsequent cleaning are done in Power Query for easier and faster process.
Below are the following steps I took to clean the Name column:
Assumption Note:
Cleaning the Education Column:
There are 5 unique educational levels in the dataset: High school, Associate, Bachelors, Masters and PhD in ascending order.
I therefore transformed all values to their corresponding levels.
领英推荐
Cleaning the Age Column:
This column is quite funny. It shows ages from 0,1,2 all the way to maximum age of 101.
It is evident that there are clear inputation errrors in this column with indivuals aged 0, 1, 2... years less than 10yrs having Masters and PhD as highest level of education.
In a real world situation: I would approach the owners of the data and request for an explanation or perhaps the correct ages.
However, in this case, here are the steps I carried out:
Identify the Issue: Filtered the dataset by the Age column to identify implausible entries (e.g., ages under 18 paired with advanced education levels like Masters or PhD). I used the Education column to determine or predict the right age of the individuals.
Assumption note:
I will carry out this operation with Excel formula and therefore I need to make an adjusment which is to replace blanks and O's with 1.
This would help prevent error such as DIV/0!.
Here is the formula I used to predict the ages based on Education:
=IF([@Age]<17, IF([@Education]= "Unknown", AVERAGE([Age]),AVERAGEIFS([Age],[Education],[@Education])),[@Age])
And there you have it. No blanks, No abnormal age.
Cleaning the Gender Column:
Cleaning the City Column:
There are 3 unique cities in this dataset. I assume that the rest are simply abbreviations.
No missing values.
I would replace abbreviations with full city names for consistency.
Cleaning the Blood Type Column:
The Blood type column has all unique blood types with no inconsistencies. However, it contins 8 missing values.
Assumption note: I assume that missing data here shows individuals who are not aware of their blood group.
This information can give insight into the level of health awareness of individuals. I therefore would not remove the blanks but replace with "Unknown".
Cleaning the Employment Status Column:
The values here are a bit too granulr with an employment status having extra informtion such as Employed(Contract), Student(Part-time).
My goal is to categorize the data into 4 main groups:
Cleaning the Salary Column:
The Salary column contain figures, some with the salary type in bracket.
Assumption note: I assume that the values in the bracket states the source of the income.
My goal here is to seperate the salary from the source of income.
The Source of Income column now has blanks, I replaced them with "Unspecified".
I removed he $ signs from salary, replaced blanks with 0 and then reset the datatype to currency ($).
The date of admission Column:
Most of the dat here are in number format while few are text formatted.
All that was needed was to move the table to Power Query and format as date. This formatted all the values in the column as date.
Phew!!
It's a ride so far.
Here is a snapshot of the cleaned data, now with 13 columns and 1,656 rows.
This data is hereby declared clean and well formatted for indepth analysis.
Thanks for following through.
Gracias :)
| Data Analyst | Proficient in Excel, SQL, Power BI, SPSS | Research & Data-Driven Insights Expert|
1 个月Love this Thanks for sharing
Statistics | SPSS | Excel | Data Analyst.
1 个月Well done brother, I so much in love with your presentation, all points are well understood. I am also a participant in this challenge, and I really enjoyed it.
Data Analyst | Financial analyst | BI | Excel & Power BI Developer | Tableau | SQL
1 个月Insightful. Love the presentation. It's top-notch
Data Analyst |Data Cleaning & Exploratory with Excel, Power BI, SQL | Data Visualization | IT Operations & Management
1 个月Got hooked on your story telling. Thank you