WHY CLEAN YOUR DATA???

WHY CLEAN YOUR DATA???

Knowing how to clean your data is advantageous for many reasons.

  • It prevents you from wasting time on wobbly or even faulty analysis
  • It prevents you from making the wrong conclusions, which would make you look bad!
  • It makes your analysis run faster. Correct, properly cleaned and formatted data speed up computation in advanced algorithms.

DATA CLEANING IS A 3-STEP PROCESS

  1. FIND THE DIRT
  2. SCRUB THE DIRT
  3. RINSE AND REPEAT


STEP 1: FIND THE DIRT?

Start data cleaning by determining what is wrong with your data.

Look for the following:

  • Are there rows with empty values? Entire columns with no data? Which data is missing and why?
  • How is data distributed? Remember, visualizations are your friends. Plot outliers. Check distributions to see which groups or ranges are more heavily represented in your dataset.
  • Keep an eye out for the weird: are there impossible values? Like “date of birth: male”, “address: -1234”.
  • Is your data consistent? Why are the same product names written in uppercase and other times in camelCase.

STEP 2: SCRUB THE DIRT?

Knowing the problem is half the battle.

The other half is solving it.

How do you solve it, though?

One ring might rule them all, but one approach is not going to cut it with all your data cleaning problems.

Depending on the type of data dirt you’re facing, you’ll need different cleaning techniques.

Step 2 is broken down into eight parts

  • Missing Data
  • Outliers
  • Contaminated Data
  • Inconsistent Data
  • Invalid Data
  • Duplicate Data
  • Data Type Issues
  • Structural Errors

STEP 3: RINSE AND REPEAT?

Once cleaned, you repeat steps 1 and 2.

This is helpful for three reasons:

  1. You might have missed something. Repeating the cleaning process helps you catch those pesky hidden issues.
  2. Through cleaning, you discover new issues. For example, once you removed outliers from your dataset, you noticed that data is not bell shaped anymore and needs reshaping before you can analyze it.
  3. You learn more about your data. Every time you sweep through your dataset and look at the distributions of values, you learn more about your data, which gives you hunches as to what to analyze

As the old machine learning wisdom goes: Garbage in, garbage out...

要查看或添加评论,请登录

社区洞察

其他会员也浏览了