登录查看更多内容

WHY CLEAN YOUR DATA???

Malaya Ranjan Aich

|Data Analyst| -- |Connecting & Shaping The Data|

发布日期: 2022年1月17日

+ 关注

Knowing how to clean your data is advantageous for many reasons.

It prevents you from wasting time on wobbly or even faulty analysis
It prevents you from making the wrong conclusions, which would make you look bad!
It makes your analysis run faster. Correct, properly cleaned and formatted data speed up computation in advanced algorithms.

DATA CLEANING IS A 3-STEP PROCESS

FIND THE DIRT
SCRUB THE DIRT
RINSE AND REPEAT

STEP 1: FIND THE DIRT?

Start data cleaning by determining what is wrong with your data.

Look for the following:

Are there rows with empty values? Entire columns with no data? Which data is missing and why?
How is data distributed? Remember, visualizations are your friends. Plot outliers. Check distributions to see which groups or ranges are more heavily represented in your dataset.
Keep an eye out for the weird: are there impossible values? Like “date of birth: male”, “address: -1234”.
Is your data consistent? Why are the same product names written in uppercase and other times in camelCase.

STEP 2: SCRUB THE DIRT?

Knowing the problem is half the battle.

领英推荐

Offensive vs. Defensive Data Strategy: Do You Really…

Hevo Data 2 年前

The data-quality-opportunity

Helge Tenn? 1 个月前

It seems like these days, everyone and everything is…

Michael Heidorn 7 年前

The other half is solving it.

How do you solve it, though?

One ring might rule them all, but one approach is not going to cut it with all your data cleaning problems.

Depending on the type of data dirt you’re facing, you’ll need different cleaning techniques.

Step 2 is broken down into eight parts

Missing Data
Outliers
Contaminated Data
Inconsistent Data
Invalid Data
Duplicate Data
Data Type Issues
Structural Errors

STEP 3: RINSE AND REPEAT?

Once cleaned, you repeat steps 1 and 2.

This is helpful for three reasons:

You might have missed something. Repeating the cleaning process helps you catch those pesky hidden issues.
Through cleaning, you discover new issues. For example, once you removed outliers from your dataset, you noticed that data is not bell shaped anymore and needs reshaping before you can analyze it.
You learn more about your data. Every time you sweep through your dataset and look at the distributions of values, you learn more about your data, which gives you hunches as to what to analyze

As the old machine learning wisdom goes: Garbage in, garbage out...

WHY CLEAN YOUR DATA???

Malaya Ranjan Aich

|Data Analyst| -- |Connecting & Shaping The Data|

领英推荐

社区洞察

其他会员也浏览了

Data Sampling: Analyzing the Whole by Examining the Part ????

Data Products & Services Framework

Introduction to Data Visualization

Normalization vs. Denormalization Data | Power BI || Belayet Hossain

Big Data - What are the challenges?

New Thinking with New BLOG: Why Data Quality (Why DQ)?

How to Pick the Perfect Chart or Graph for Your Data?

Why the f**k do we see so many bar charts?

Turning Data Into Value

When Data Gives You the Wrong Answer: The Story of the Bullet Holes and a Crucial Lesson in Data Analysis