登录查看更多内容

Data sanity checks: Let's deep dive

Sayantika Banik

Building DataJourneyHQ | Community | Open-Source | DiveMaster

发布日期: 2024年7月15日

+ 关注

A quick summary with oceanic examples {I love ocean ??}

1. Range Checks

Ensure gathered data values fall within expected limits.

Example:

Sea Temperature: -2°C to 35°C (since seawater can be just below freezing due to salinity and up to tropical temperatures).
Depth: 0 meters to 11,000 meters (Mariana Trench).

2. Type Checks

Confirm data values are of the correct type.

Example:

Numerical: Check if salinity contains only numbers.
Date: Ensure coral sampling_date has valid dates.

3. Format Checks

Ensure data follows specific formats.

Example:

GPS Coordinates: Validate they follow ±DD.DDDD, ±DDD.DDDD.
Sample ID: Ensure sample IDs are in the format STN-XX-YY (random example format, not to be confused with actual formats)

4. Uniqueness Checks

Ensure fields contain unique values.

Example:

Dive site ID: Each dive site should be equiped with unique ID
Survey ID: Verify each survey_id is unique.

5. Consistency Checks

Ensure data values are logically consistent.

领英推荐

5 Things You Wish Were True About Data Analysis But…

Benjamin Bennett Alexander 9 个月前

What is data analysis and why is it important?

Anurodh Kumar 1 年前

Don't dam(n) your data.

XOi 1 年前

Example:

Date Consistency: End_date should be after start_date.
Geographical Consistency: Check longitude and latitude are valid oceanic locations.

6. Completeness Checks

Ensure all required fields are filled.

Example:

Required Fields: Verify sampling_date, station_id, and sea_temperature are not empty.
Null Checks: No required fields should be null.

7. Validity Checks

Confirm data values meet predefined rules.

Example:

Species Codes: Match marine species codes to a recognized registry.
Ocean Current Directions: North/South or in between N-S with winds

8. Duplicate Checks

Identify and handle duplicate records.

Example:

Duplicate Samples: Check for duplicates in samples based on station_id and sampling_date.
Voyage IDs: Ensure no duplicate voyage_id exists.

Why Data Sanity Checks Matter?

Better Data Quality: Reduces errors and enhances analysis reliability.
Data Integrity: Maintains trustworthiness and accuracy.
Compliance: Meets industry standards and regulations.
Efficiency: Saves time on data cleaning and processing.
Informed Decisions: High-quality data leads to better insights and outcomes.

Conclusion

Understanding data pre-processing is of utmost importance. Functional data facilities quicker, scalability and economical outcomes. Let's keep data tidy!

Sayantika Banik的更多文章

Data != Useful Data

2024年7月12日

Data != Useful Data

Owning tons of data doesn't always translate to actionable outcomes ?? Treading in Open Waters Imagine being in the…

Data sanity checks: Let's deep dive

Sayantika Banik

Building DataJourneyHQ | Community | Open-Source | DiveMaster

1. Range Checks

2. Type Checks

3. Format Checks

4. Uniqueness Checks

5. Consistency Checks

领英推荐

6. Completeness Checks

7. Validity Checks

8. Duplicate Checks

Why Data Sanity Checks Matter?

Conclusion

Sayantika Banik的更多文章

社区洞察

其他会员也浏览了

Oh Snap. You're 3/4 Through Your BI Initiative and the Data Was Wrong: Part 1

3 Surprising Ways Clean Data Can Add Millions to Your Bottom Line

Data Quality and Availability challenges faced by 45% of individuals in AI-driven data science implementation

Data Fallacies

How to solve Data Inconsistency and Duplication?

'Use this data' fallacy

Data Structure Introduction

_________________________________________________

Understanding Outliers: Their Impact on Data Analysis and Study Outcomes

?? The importance of context

1. Range Checks

2. Type Checks

3. Format Checks

4. Uniqueness Checks

5. Consistency Checks

领英推荐

6. Completeness Checks

7. Validity Checks

8. Duplicate Checks

Why Data Sanity Checks Matter?

Conclusion

Sayantika Banik的更多文章

Data != Useful Data

社区洞察

其他会员也浏览了

Oh Snap. You're 3/4 Through Your BI Initiative and the Data Was Wrong: Part 1

3 Surprising Ways Clean Data Can Add Millions to Your Bottom Line

Data Quality and Availability challenges faced by 45% of individuals in AI-driven data science implementation

Data Fallacies

How to solve Data Inconsistency and Duplication?

'Use this data' fallacy

Data Structure Introduction

_________________________________________________

Understanding Outliers: Their Impact on Data Analysis and Study Outcomes

?? The importance of context