Data sanity checks: Let's deep dive
Me: Wandering dive sites, for interesting macro data

Data sanity checks: Let's deep dive

A quick summary with oceanic examples {I love ocean ??}

1. Range Checks

Ensure gathered data values fall within expected limits.

Example:

  • Sea Temperature: -2°C to 35°C (since seawater can be just below freezing due to salinity and up to tropical temperatures).
  • Depth: 0 meters to 11,000 meters (Mariana Trench).

2. Type Checks

Confirm data values are of the correct type.

Example:

  • Numerical: Check if salinity contains only numbers.
  • Date: Ensure coral sampling_date has valid dates.

3. Format Checks

Ensure data follows specific formats.

Example:

  • GPS Coordinates: Validate they follow ±DD.DDDD, ±DDD.DDDD.
  • Sample ID: Ensure sample IDs are in the format STN-XX-YY (random example format, not to be confused with actual formats)

4. Uniqueness Checks

Ensure fields contain unique values.

Example:

  • Dive site ID: Each dive site should be equiped with unique ID
  • Survey ID: Verify each survey_id is unique.

5. Consistency Checks

Ensure data values are logically consistent.

Example:

  • Date Consistency: End_date should be after start_date.
  • Geographical Consistency: Check longitude and latitude are valid oceanic locations.

6. Completeness Checks

Ensure all required fields are filled.

Example:

  • Required Fields: Verify sampling_date, station_id, and sea_temperature are not empty.
  • Null Checks: No required fields should be null.

7. Validity Checks

Confirm data values meet predefined rules.

Example:

  • Species Codes: Match marine species codes to a recognized registry.
  • Ocean Current Directions: North/South or in between N-S with winds

8. Duplicate Checks

Identify and handle duplicate records.

Example:

  • Duplicate Samples: Check for duplicates in samples based on station_id and sampling_date.
  • Voyage IDs: Ensure no duplicate voyage_id exists.

Why Data Sanity Checks Matter?

  1. Better Data Quality: Reduces errors and enhances analysis reliability.
  2. Data Integrity: Maintains trustworthiness and accuracy.
  3. Compliance: Meets industry standards and regulations.
  4. Efficiency: Saves time on data cleaning and processing.
  5. Informed Decisions: High-quality data leads to better insights and outcomes.

Conclusion

Understanding data pre-processing is of utmost importance. Functional data facilities quicker, scalability and economical outcomes. Let's keep data tidy!

Bhavani Ravi

Head of Engineering, Stitchflow | Python Backend / Data and Devops Systems | Speaker Pycon India and Australia

8 个月

Wonderful article

要查看或添加评论,请登录

Sayantika Banik的更多文章

  • Data != Useful Data

    Data != Useful Data

    Owning tons of data doesn't always translate to actionable outcomes ?? Treading in Open Waters Imagine being in the…

社区洞察

其他会员也浏览了