How to Clean Your Data
sweep!sweep!

How to Clean Your Data

I’m sure we’ve all heard it, data is the new oil. But is there a better form of data? Should we use the data exactly as we found it? Let’s talk about approaching data and maximizing it for the best.

Dirty data often leads to making poor business decisions that will result in losses, the very thing we are trying to avoid.

According to IBM estimates, bad quality of data results in losses of up to $3.1 trillion per year in the US alone.

Extrapolate that to the rest of the world and the losses are astronomical! Cleaning data, therefore, becomes an inescapable part of the process. I say, it’s the most important part of the process. It is definitely the most boring of your data tasks and yet without it you will suffer down the line.

What is Data Cleaning?

Whenever we come across raw data, it likely has faulty information contained therein. This could be as a result of human error or scraping data. Sometimes, combining data from different sources results in dirty data. Take for example a form filled online. A person, just for the heck of it might input their gender as letter ‘J’. Strictly speaking, there are two basic genders Male and Female. And J falls into neither category. This becomes problematic since neither J nor K represents either gender. Using such an observation would end up distorting our results and hence, it is important to deal with such issues at the very beginning.

Data Cleaning is identifying inconsistencies and dealing with them accordingly. There is no standard way to clean data but there are steps that are generally applicable to all datasets. Let’s get into this

Step One: Remove duplicate observations.

Assuming your dataset is in csv format, you might have rows that are duplicated. Your first task is to identify these rows and remove them. Duplicates usually come about when you combine data from different places. For instance, a user might have taken a survey twice. We cannot assume that these are two different observations to maintain accuracy. Whether you are using excel, python or any kind of data analysis software, removing duplicates should be the first step in data cleaning.

While you are getting rid of duplicate observations, it would be wise to get rid of anomalies as well. Anomalies are those observations in the dataset that do not fit within the parameters you are studying. For instance, if our data regards animals but you find an observation with a fridge then you have to remove it. This particular observation is not consistent with the problem under observation.

Step Two: Deal with Structural Errors

It’s important that your data is consistent. For instance, if we are measuring weight in kilograms, having an observation in grams will throw us off our game. At this stage we deal with problems such as misspellings, mislabeled classes or wrong naming conventions. Dates are especially important here as they should be uniform.

Incomplete observations should also be done away with if optimum results are to be achieved.

Step Three: Deal with Outliers

This is a tricky one to handle. Outliers are observations that seem to be way off the normal in a data set. For instance, if we are dealing with salaries of a teacher in a certain state, which has an average of $80,000, one observation could have $ 800,000. Now, it could be that in that state there is an actual teacher who earns that amount or it could be a typo. It could even be someone joking around. When dealing with outliers, this is on a case-by-case basis. Sometimes, there are legitimate reasons to keep outliers while in other cases it’s best to get rid of them. Perhaps you can keep them in the EDA phase of the analysis but get rid of them in the modelling phase. This way, you can explore why the outlier is there but it shouldn’t ruin your model.

Step Four: Handle Missing Data

As a data analyst you will always find missing data and working with missing data is bound to be hard especially when modelling. There are many ways to deal with missing values and this varies case to case. The first and easiest is to drop all observations that have missing values. But, sometimes there are too many missing values and deleting all these observations will change the results. For example, if a ‘Gender’ column has many missing values and we decide to delete the entire column, we will never know the effect of gender on this data set. For this reason, deletion might not always work.

The second option is to impute missing values. This one is more recommended since no data is lost. In many cases, we impute zero or in some cases, we impute the mean, median or mode. Sometimes, we forward fill or backward fill. This is to replace with the previous value or next value respectively. As a data analyst, you have to decide which one to use to impute the missing values.

Once all these steps are completed, your data is pretty much ready for use. As you clean and interact with your data, you need to determine if it makes sense and is generally fit for the field in which you find it. Always think critically as you work with your data set. At this point, start exploring and looking for insights.

Good luck on your journey!!

要查看或添加评论,请登录

Kanja Farnadis的更多文章

  • Self Joins

    Self Joins

    As a data analyst, you will have to combine data from different tables. This is where joins come in.

  • R Vs Python: The great debate

    R Vs Python: The great debate

    When I was doing the google analytics certificate, I noticed that Google teaches R as the chosen preferred language…

    7 条评论
  • Why Should you Learn Python?

    Why Should you Learn Python?

    It’s 2022. Everyone wants to learn how to code.

  • Five Life Lessons Coding Bootcamp has taught me

    Five Life Lessons Coding Bootcamp has taught me

    I have never liked camps. The intensity of it, the short period, and the fast pace were not up my alley.

    4 条评论

社区洞察

其他会员也浏览了