Data Cleaning Challenge

Data Cleaning Challenge

This is a data cleaning challenge created by GreenData Solutions . I decided to join due to my love for Excel and Power Query, because I want to help others and because I have not participated in a data cleaning challenge. Thanks to Chisom Ibemere for this initiative.

Project Description:

You are hired as a Data Analytics Consultant in a healthcare system. Your primary responsibility is to analyze and clean messy data related to patients' health records. This involves identifying and addressing inconsistencies, missing values, and formatting errors, all of which are commonly encountered in real-world datasets.

Tools Used: Excel & Power Query

Dataset:

The dataset is a healthcare data provided in csv format. The data contains 1,659 rows and 12 columns.

Below is the data dictionary provided:

Image I

Here is a snapshot of the original dirty data:

Image II
Let's get started.

I take it as a principle to carry out 2 steps before data cleaning proper:

Step 1: Create a back-up of the original dataset

This first thing I did was to create a copy of the original dataset. This creates a backup for the original dataset

Step 2: Convert the dataset to a table

Converting a dataset into a table in Excel before starting data cleaning offers several advantages which include:

  • Structured Data: Tables help organize data into a structured format, making it easier to work with and analyze.
  • Auto-Formatting: Excel tables automatically format data, applying consistent font, alignment, and border styles. This naturally makes me feel good throughout the process.
  • Easy Filtering and Sorting: Tables enable quick filtering and sorting of data, allowing you to focus on specific subsets or identify patterns. I bet to say this is the most important to data cleaning.

Table Format of Dirty Data
Cool right? Let's start cleaning

Note: The scope of this cleaning would be column-wise which I see to be the best technique. Therefore, I would be taking the columns one after the other.


Cleaning ID column

From the data dictionary above. It is stated that the ID is a unique identifier of each indivual or record.

Assumption note: I took each record as unique to each patient. In other words, I assume this to be a data of patients in an healthcare facility.

As a result, the ID is the most important column in this dataset.

  • Confirm the uniqueness of the ID: First step is to confirm the uniqueness of the ID as stated in the data dictionary. I did this by conditionally formatting the duplicates by color as seen below:

Image 1.0

From above, there are 286 cases of duplicated IDs. However, after further exploration, I discovered that many of these duplicated IDs have different names, age and other variables. Perhaps they are not true duplicates afterall.

This would only mean that the ID are not unique identifiers of each individual or patient since we now have 2 different individuals (i.e with different demographics) that have the same ID.

In essence, these duplicate IDs does not translate to duplicte individuals.

But still, are there true duplicates? That is, duplicate IDs with the same demographics (eg, Name, age etc)?

I will apply the same formatting to the Name column and filter by color,

BUT we need to clean the Name column first, the format is quite ugly and can confuse Excel, leading to a bit of frustration on my part. I hate to be frustrated :).

Cleaning the Name Column:

As we can see from the dirty data above above that the Name column is pretty ugly.

The goal is have the names in Proper style and slit it into two columns: First Name and Last Name as this helps make easier analysis.

Assumption note: I assume that the first name and the second name in the column are the First and Last name of the individuals.

This step and subsequent cleaning are done in Power Query for easier and faster process.

Below are the following steps I took to clean the Name column:

Image 2.0- Capitalizing Name Column

  1. columnTransform the column to Proper Case (Transform > Capitalize in Power Query). (Image 2.0)
  2. Split the column at every occurence of the delimiter (space). It splits into 4. (Image 2.1)
  3. The four splits occurs due to presence of some titles like Dr., Ms, Mr., Jnr, Phd, Ii etc which we will remove by replacing them with nothing. It is important to mtch entire cell content.
  4. Merge the columns into one and Trim.
  5. Checking True duplicate records: Back to the IDs. Now that we have the name cleaned. I formatted both columns, filter by color and sorted Names.
  6. From Image 2.2, There are 3 true duplicate IDs.
  7. I removed the duplicates and Splitted the column once again by delimiter.
  8. I then renamed the columns as First Name and Last Name.

Image 2.1- Columns splits into 4
Image 2.2- 3 True duplicate IDs are pesent

Assumption Note:

  1. I therefore make a strong assumption that the IDs are not unique identifiers of indivuals or records in this dataset.
  2. From the image below, the first split contains numerous duplicate names and there, I assume that the first column is the last name (Family Name) while the second is the first name as it is much more unique.

Repeting Surnames for possibly family members

Cleaning the Education Column:

There are 5 unique educational levels in the dataset: High school, Associate, Bachelors, Masters and PhD in ascending order.

I therefore transformed all values to their corresponding levels.

Cleaning the Age Column:

This column is quite funny. It shows ages from 0,1,2 all the way to maximum age of 101.

It is evident that there are clear inputation errrors in this column with indivuals aged 0, 1, 2... years less than 10yrs having Masters and PhD as highest level of education.

In a real world situation: I would approach the owners of the data and request for an explanation or perhaps the correct ages.

However, in this case, here are the steps I carried out:

Identify the Issue: Filtered the dataset by the Age column to identify implausible entries (e.g., ages under 18 paired with advanced education levels like Masters or PhD). I used the Education column to determine or predict the right age of the individuals.

Assumption note:

  1. First, I assume that this dataset is for US citizens.
  2. Since the Lowest level of education in the dataset is High School, I assume that the lowest age allowed to be in the dataset should 17yrs, the High School grduating age in the United States.

I will carry out this operation with Excel formula and therefore I need to make an adjusment which is to replace blanks and O's with 1.

This would help prevent error such as DIV/0!.

Here is the formula I used to predict the ages based on Education:

=IF([@Age]<17, IF([@Education]= "Unknown", AVERAGE([Age]),AVERAGEIFS([Age],[Education],[@Education])),[@Age])        

And there you have it. No blanks, No abnormal age.

Cleaning the Gender Column:

  • The gender column has 1,626 missing values out of 1,657 which is over 90% missing data.
  • I would therefore remove this column.

Cleaning the City Column:

There are 3 unique cities in this dataset. I assume that the rest are simply abbreviations.

No missing values.

I would replace abbreviations with full city names for consistency.

Cleaning the Blood Type Column:

The Blood type column has all unique blood types with no inconsistencies. However, it contins 8 missing values.

Assumption note: I assume that missing data here shows individuals who are not aware of their blood group.

This information can give insight into the level of health awareness of individuals. I therefore would not remove the blanks but replace with "Unknown".

Cleaning the Employment Status Column:

The values here are a bit too granulr with an employment status having extra informtion such as Employed(Contract), Student(Part-time).

My goal is to categorize the data into 4 main groups:

  • Employed
  • Unemployed
  • Retired
  • Student

Cleaning the Salary Column:

The Salary column contain figures, some with the salary type in bracket.

Assumption note: I assume that the values in the bracket states the source of the income.

My goal here is to seperate the salary from the source of income.

The Source of Income column now has blanks, I replaced them with "Unspecified".

I removed he $ signs from salary, replaced blanks with 0 and then reset the datatype to currency ($).

  • The Health Condition Column has few blanks. I replaced them with Unknown.
  • Credit Score Column: I corrected typos and replaced None with N/A.


The date of admission Column:

Most of the dat here are in number format while few are text formatted.

All that was needed was to move the table to Power Query and format as date. This formatted all the values in the column as date.

Well formatted date

Phew!!

It's a ride so far.

Here is a snapshot of the cleaned data, now with 13 columns and 1,656 rows.

This data is hereby declared clean and well formatted for indepth analysis.

Thanks for following through.

Gracias :)


Chika Mbah

| Data Analyst | Proficient in Excel, SQL, Power BI, SPSS | Research & Data-Driven Insights Expert|

1 个月

Love this Thanks for sharing

回复
Taiwo Tijani

Statistics | SPSS | Excel | Data Analyst.

1 个月

Well done brother, I so much in love with your presentation, all points are well understood. I am also a participant in this challenge, and I really enjoyed it.

回复
James Isaac

Data Analyst | Financial analyst | BI | Excel & Power BI Developer | Tableau | SQL

1 个月

Insightful. Love the presentation. It's top-notch

Wasem J. Anosike

Data Analyst |Data Cleaning & Exploratory with Excel, Power BI, SQL | Data Visualization | IT Operations & Management

1 个月

Got hooked on your story telling. Thank you

要查看或添加评论,请登录

Gbenga Obikoya的更多文章

社区洞察

其他会员也浏览了