ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

How to Clean Your Data

Kanja Farnadis

Data Analytics|Data Science|Machine Learning|Data Storyteller|Content Creator

å‘å¸ƒæ—¥æœŸ: 2023å¹´5æœˆ29æ—¥

Iâ€™m sure weâ€™ve all heard it, data is the new oil. But is there a better form of data? Should we use the data exactly as we found it? Letâ€™s talk about approaching data and maximizing it for the best.

Dirty data often leads to making poor business decisions that will result in losses, the very thing we are trying to avoid.

According to IBM estimates, bad quality of data results in losses of up to $3.1 trillion per year in the US alone.

Extrapolate that to the rest of the world and the losses are astronomical! Cleaning data, therefore, becomes an inescapable part of the process. I say, itâ€™s the most important part of the process. It is definitely the most boring of your data tasks and yet without it you will suffer down the line.

What is Data Cleaning?

Whenever we come across raw data, it likely has faulty information contained therein. This could be as a result of human error or scraping data. Sometimes, combining data from different sources results in dirty data. Take for example a form filled online. A person, just for the heck of it might input their gender as letter â€˜Jâ€™. Strictly speaking, there are two basic genders Male and Female. And J falls into neither category. This becomes problematic since neither J nor K represents either gender. Using such an observation would end up distorting our results and hence, it is important to deal with such issues at the very beginning.

Data Cleaning is identifying inconsistencies and dealing with them accordingly. There is no standard way to clean data but there are steps that are generally applicable to all datasets. Letâ€™s get into this

Step One: Remove duplicate observations.

Assuming your dataset is in csv format, you might have rows that are duplicated. Your first task is to identify these rows and remove them. Duplicates usually come about when you combine data from different places. For instance, a user might have taken a survey twice. We cannot assume that these are two different observations to maintain accuracy. Whether you are using excel, python or any kind of data analysis software, removing duplicates should be the first step in data cleaning.

While you are getting rid of duplicate observations, it would be wise to get rid of anomalies as well. Anomalies are those observations in the dataset that do not fit within the parameters you are studying. For instance, if our data regards animals but you find an observation with a fridge then you have to remove it. This particular observation is not consistent with the problem under observation.

é¢†è‹±æŽ¨è

In praise of DIY data work

Barton Poulson, PhD 1 ä¸ªæœˆå‰

Why Your Data Models Are Failing: The Hidden Mistakes Youâ€™re Overlooking

Why Your Data Models Are Failing: The Hidden Mistakesâ€¦

Steven Murhula 1 ä¸ªæœˆå‰

The Essential Guide to a Data Cleaning Framework

WillDom 4 ä¸ªæœˆå‰

Step Two: Deal with Structural Errors

Itâ€™s important that your data is consistent. For instance, if we are measuring weight in kilograms, having an observation in grams will throw us off our game. At this stage we deal with problems such as misspellings, mislabeled classes or wrong naming conventions. Dates are especially important here as they should be uniform.

Incomplete observations should also be done away with if optimum results are to be achieved.

Step Three: Deal with Outliers

This is a tricky one to handle. Outliers are observations that seem to be way off the normal in a data set. For instance, if we are dealing with salaries of a teacher in a certain state, which has an average of $80,000, one observation could have $ 800,000. Now, it could be that in that state there is an actual teacher who earns that amount or it could be a typo. It could even be someone joking around. When dealing with outliers, this is on a case-by-case basis. Sometimes, there are legitimate reasons to keep outliers while in other cases itâ€™s best to get rid of them. Perhaps you can keep them in the EDA phase of the analysis but get rid of them in the modelling phase. This way, you can explore why the outlier is there but it shouldnâ€™t ruin your model.

Step Four: Handle Missing Data

As a data analyst you will always find missing data and working with missing data is bound to be hard especially when modelling. There are many ways to deal with missing values and this varies case to case. The first and easiest is to drop all observations that have missing values. But, sometimes there are too many missing values and deleting all these observations will change the results. For example, if a â€˜Genderâ€™ column has many missing values and we decide to delete the entire column, we will never know the effect of gender on this data set. For this reason, deletion might not always work.

The second option is to impute missing values. This one is more recommended since no data is lost. In many cases, we impute zero or in some cases, we impute the mean, median or mode. Sometimes, we forward fill or backward fill. This is to replace with the previous value or next value respectively. As a data analyst, you have to decide which one to use to impute the missing values.

Once all these steps are completed, your data is pretty much ready for use. As you clean and interact with your data, you need to determine if it makes sense and is generally fit for the field in which you find it. Always think critically as you work with your data set. At this point, start exploring and looking for insights.

Good luck on your journey!!

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Kanja Farnadisçš„æ›´å¤šæ–‡ç«

Self Joins

2023å¹´6æœˆ28æ—¥

Self Joins

As a data analyst, you will have to combine data from different tables. This is where joins come in.
R Vs Python: The great debate

2022å¹´8æœˆ5æ—¥

R Vs Python: The great debate

When I was doing the google analytics certificate, I noticed that Google teaches R as the chosen preferred languageâ€¦

7 æ¡è¯„è®º
Why Should you Learn Python?

2022å¹´4æœˆ19æ—¥

Why Should you Learn Python?

Itâ€™s 2022. Everyone wants to learn how to code.
Five Life Lessons Coding Bootcamp has taught me

2022å¹´3æœˆ21æ—¥

Five Life Lessons Coding Bootcamp has taught me

I have never liked camps. The intensity of it, the short period, and the fast pace were not up my alley.

4 æ¡è¯„è®º

How to Clean Your Data

Kanja Farnadis

Data Analytics|Data Science|Machine Learning|Data Storyteller|Content Creator

Step One: Remove duplicate observations.

é¢†è‹±æŽ¨è

Step Two: Deal with Structural Errors

Step Three: Deal with Outliers

Step Four: Handle Missing Data

Kanja Farnadisçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Data Cleaning using SAS

Ways of Identifying outliers and missing values in your data during exploratory data analysis?

Demystifying Data Cleaning: Strategies for Handling Messy Data

How to deal with data uncertainty and complexity

Stay Competitive: 2 Essential LLM Prompts Every Data Analyst Needs

Data Cleaning 101: Why It Matters and Where to Start

Data Cleaning Challenge

Top 10 Data Cleaning Techniques You Should Know in 2025

KMeans Applied to Business

Revolutionize Data Analytics with Recursive CTEs

Step One: Remove duplicate observations.

é¢†è‹±æŽ¨è

Step Two: Deal with Structural Errors

Step Three: Deal with Outliers

Step Four: Handle Missing Data

Kanja Farnadisçš„æ›´å¤šæ–‡ç«

Self Joins

R Vs Python: The great debate

Why Should you Learn Python?

Five Life Lessons Coding Bootcamp has taught me

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Data Cleaning using SAS

Ways of Identifying outliers and missing values in your data during exploratory data analysis?

Demystifying Data Cleaning: Strategies for Handling Messy Data

How to deal with data uncertainty and complexity

Stay Competitive: 2 Essential LLM Prompts Every Data Analyst Needs

Data Cleaning 101: Why It Matters and Where to Start

Data Cleaning Challenge

Top 10 Data Cleaning Techniques You Should Know in 2025

KMeans Applied to Business

Revolutionize Data Analytics with Recursive CTEs

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†