登录查看更多内容

A Look at ETL Testing: Importance, Process and Type

Nguy?n Tu?n D??ng

?? Data Engineer - GDS | VNGGames

发布日期: 2024年6月7日

ETL stands for Extract, Transform and Load. It's a fundamental approach used by data integration tools and BI platforms to convert raw data into valuable insights.

Here's how it works:

1. Extraction: Gather historical or real-time data from multiple systems (ERP, CRM, third-party sources) in various formats.

2. Transformation: Place data in a staging area and reformat it to a standard model (e.g., $34.5, 0.9 cents, and $01,65 convert to $34.5, $0.9, $1.65).

3. Load: The final stage of an ETL process is loading the structured and formatted data into a database.

What is ETL testing and why do we need it?

ETL testing ensures data integrity during the transfer from source to Data Warehouse. It mitigates risks such as:

Multiple sources with varying formats.
Large and growing data volumes.
Error-prone data mapping processes, leading to duplicates and quality issues.

Common ETL Testing Errors:

Invalid source values: Resulting in missing data at the destination.
Dirty data: Not conforming to mapping rules.
Inconsistent formats: Between source and target databases.
Input/output bugs: Accepting invalid values, rejecting valid ones.
Performance issues: When handling multiple users or large data volumes.

Now, how do we ensure that data was safely mapped, transformed and delivered to its destination.

The ETL Testing process:

Understanding Businees Requirements:

Designing an effective ETL testing process requires understanding your organization's business requirements. This involves examining its data models, business workflows, reports, sources and destinations, and data pipelines.

Data Source Indentification and Validation:

Identify source data and perform initial schema checks and table validation to ensure the ETL process aligns with business requirements.

Creating and Executing Test Cases:

Source-to-target mapping and test case design are the the next steps:

Check that all expected data is loaded into target database.
Compare the number of records between source and target tables.
Check if there are any rejected tables.
Check that the data is displayed in full in target database.
Check boundary value analysis.
Compare unique values of key fields between source and target tables.
Write ETL test cases in SQL with queries for both source and target data extraction.

Data Extraction and Reporting:

Base on business requirements and use cases. During test case execution, identify the different kinds of errors or defects, try to reproduce them, and log them with adequate details and screenshots.

Applying Transformations:

Ensure that transformations match the destination data warehouse schema, validate dataflow, check data thresholds and confirm data types align with mapping documents.

Loading Data into The Data Warehouse:

Perform record count checks before and after data movement, verify rejection of invalid data, and acceptance of default values.

Re-Testing the Bug(Regression Tesing)

Retest fixed bugs in the staging environment to ensure no traces remain and confirm no new defects have been introduced.

领英推荐

How to Master ETL Processes for Clean and Usable Data

Muhammad Ishtiaq Khan 6 个月前

AI for ETL Testing, Shift-Left Dead, Cypress A11y and…

Joe Colantonio 8 个月前

ETL with Mage is like the secret sauce that helps you…

Rana Sheharyar 11 个月前

Summary Report and Test Closure

Prepare a detailed summary report of the testing process, defects, and test cases. Test the report's options, filter, layout, and export functionality. Inform stakeholders of any incomplete steps.

Types of testing

Production Validation and Reconciliation: validates the order and logic of the data in production.

Source-to-target Validation: Ensure data count matches between source and destination.

Metadata Testing: check the data types, indexes, lengths, constraints, schemas, and values between the source and target systems.

Completeness Testing: This verifies that all source data is loaded into the destination system without duplication, repetition or loss.

Transformation Testing: Confirm consistent data transformations.

Accuracy Testing: Ensure data content remains unchanged despite format/schema changes.

Data Quality Testing: this testing type focuses on data quality to identify invalid characters, precisions, nulls, and patterns. It reports any invalid data.

Report testing: Check the data in the summary report, determines if the layout and functionality are appropriate, and performs calculations for additional analytical requirements.

Application Migration Testing: Application migration testing verifies whether the ETL application is functioning properly following migration to a new platform or box.

Data and Constraint Checks: This testing technique checks the datatype, length, index and constrains.

Conclusion:

ETL testing is crucial for ensuring data integrity and quality during the ETL process. It addresses risks like data loss, corruption, and inconsistency from multiple sources, large volumes, and complex mappings.
By understanding business requirements, validating data sources, and executing detailed test cases, organizations ensure accurate data extraction, transformation, and loading.
Effective ETL testing results in reliable data warehouses that support accurate business analytics and decision-making.

Reference:

[1]. ETL Testing: Importance, Process, and ETL Testing Tools.

[2]. ETL Testing: Processes, Types, and Best Practices

[3]. What is ETL Developer: Role Description, Process Breakdown, Responsibilities, and Skills

Bill Harder

Business Intelligence Developer | SQL Developer | Visualization Expert

8 个月

Hi doung, ETL testing is a crucial process to ensure data integrity and accuracy in data integration workflows.?Lyftrondata simplifies ETL testing with its powerful features. It offers unified data integration, real-time data pipelines, and automated workflows, making it easier to manage and test ETL processes. With Lyftrondata's low-code platform, creating and testing ETL pipelines is streamlined and efficient, ensuring high data quality and robust BI reporting.

1 次回应

D??ng Xuan ?à

??Java Software Engineer | Oracle Certified Professional

9 个月

Nice article

Jean Guinvarch

Data engineer apprentice @Sanofi

9 个月

In the staging area, do you think it is relevant to simulate the data loading in a non-persistent data warehouse, such as a local instance like DuckDB, after all the transformations and tests have been completed? This would allow for validation of source and target data, potentially preventing issues before loading data into the production environment.

Th?ng ?ào T?t

? Data Science student at Hanoi University of Science

9 个月

Thanks for sharing!

1 次回应

Hung Nguyen Thanh

Data Engineer | DataOps

9 个月

Mà anh cho em h?i,data pipeline test này nó s? tách bi?t v?i pipeline dev lu?n,hay nó tích h?p chung l?i v?i nhau ?

1 次回应

查看更多评论

要查看或添加评论，请登录

Nguy?n Tu?n D??ng的更多文章

Ph?n 3: ?ng d?ng Kafka vào ví d? h? th?ng Ad Click Aggregation

2025年1月15日

Ph?n 3: ?ng d?ng Kafka vào ví d? h? th?ng Ad Click Aggregation

Ad Click Aggregation là h? ???c s? d?ng ?? theo d?i hành vi c?a ng??i dùng th?ng qua hành vi click vào popup qu?ng cáo.…

6 条评论
Ph?n 2: T?ng quan v? Nguyên ly ho?t ??ng c?a Kafka

2025年1月13日

Ph?n 2: T?ng quan v? Nguyên ly ho?t ??ng c?a Kafka

Nh? hình v? ta có th? r?ng trong ki?n trúc c?a Kafka bao g?m có nh?ng ph?n chính v?i nhi?m v? nh? sau: Topic: Gi?ng nh?…
Ph?n 1: Message-driven programming v?i Message broker và Apache Kafka

2025年1月9日

Ph?n 1: Message-driven programming v?i Message broker và Apache Kafka

Chúng ta s? cùng tìm hi?u Kafka qua m?t ví d?: Trong m?t gi?i ??u bóng ?á nh? AFF Cup, do Liên ?oàn Bóng ?á ??ng Nam á…

2 条评论
End-to-End Real-Time Gaming Time Management System using Kafka, Spark, Cassandra,Redis and FastApi

2025年1月6日

End-to-End Real-Time Gaming Time Management System using Kafka, Spark, Cassandra,Redis and FastApi

Git repo: Gaming_Time_Management_System ?? Overview: This project contains the design and implementation of an…

3 条评论
Building an End-to-End Data Pipeline: From Raw Ingestion to Insightful Reporting with USGS Earthquake Data

2024年12月27日

Building an End-to-End Data Pipeline: From Raw Ingestion to Insightful Reporting with USGS Earthquake Data

Git repo: Worldwide_Earthquake_AzureFabric ?? Overview In today's data-driven world, building a seamless, scalable, and…

1 条评论
?? End-to-End Data Pipeline: From Snowflake Cloud & HTTP API to Azure Synapse Analytics

2024年12月23日

?? End-to-End Data Pipeline: From Snowflake Cloud & HTTP API to Azure Synapse Analytics

Git repo: Snowflake_HTTPAPI_Azure_Synapse_Analytics.git ?? Overview In today's data-driven world, businesses thrive on…

3 条评论
End-to-End Data Pipeline with Snowflake, Airflow, and dbt

2024年12月17日

End-to-End Data Pipeline with Snowflake, Airflow, and dbt

Link repo: https://github.com/ntd284/dbt-airflow-snowflake ?? Overview In today's data-driven landscape, businesses are…

6 条评论
Overview of Azure Data Factory Components

2024年11月25日

Overview of Azure Data Factory Components

?? Overview Organizations often deal with massive volumes of data that are often unstructured, scattered across various…
DE project: Transforming Olympic Data Insights (2/2): End-to-End Analytics with Azure Data Factory, Databricks, Synapse, and SuperSet

2024年11月23日

DE project: Transforming Olympic Data Insights (2/2): End-to-End Analytics with Azure Data Factory, Databricks, Synapse, and SuperSet

Reading Part 1 here ?? Overview In the second part of this series, we’ll focus on how Azure Synapse Analytics empowers…

2 条评论
Transforming Olympic Data Insights (1/2): End-to-End Analytics with Azure Data Factory, Databricks, Synapse, and Power BI

2024年11月22日

Transforming Olympic Data Insights (1/2): End-to-End Analytics with Azure Data Factory, Databricks, Synapse, and Power BI

Link repo: https://github.com/ntd284/azure-tokyo-olympic-analytics ?? Overview In today’s data-driven world, building a…

1 条评论

See all articles

A Look at ETL Testing: Importance, Process and Type

Nguy?n Tu?n D??ng

?? Data Engineer - GDS | VNGGames

What is ETL testing and why do we need it?

The ETL Testing process:

领英推荐

Types of testing

Conclusion:

Reference:

Nguy?n Tu?n D??ng的更多文章

社区洞察

其他会员也浏览了

Mastering ETL Testing: A Comprehensive Guide

What is ETL (Extract, Transform, Load)?

?? Integrations Unlocked: ETL Pipelines (Part 5) ??

Real-Time vs. Batch ETL Testing: Which is Right for You?

WHAT IS ETL

ETL, ELT and Other Data integration process

WHT IS ETL

ETL, data file exchange and CSV import tools: How to choose the ideal solution for your use case

The ETL Process: From Data Extraction to Business Insights

A Comprehensive Guide to the ETL Process for Data Analytics

What is ETL testing and why do we need it?

The ETL Testing process:

领英推荐

Types of testing

Conclusion:

Reference:

Nguy?n Tu?n D??ng的更多文章

Ph?n 3: ?ng d?ng Kafka vào ví d? h? th?ng Ad Click Aggregation

Ph?n 2: T?ng quan v? Nguyên ly ho?t ??ng c?a Kafka

Ph?n 1: Message-driven programming v?i Message broker và Apache Kafka

End-to-End Real-Time Gaming Time Management System using Kafka, Spark, Cassandra,Redis and FastApi

Building an End-to-End Data Pipeline: From Raw Ingestion to Insightful Reporting with USGS Earthquake Data

?? End-to-End Data Pipeline: From Snowflake Cloud & HTTP API to Azure Synapse Analytics

End-to-End Data Pipeline with Snowflake, Airflow, and dbt

Overview of Azure Data Factory Components

DE project: Transforming Olympic Data Insights (2/2): End-to-End Analytics with Azure Data Factory, Databricks, Synapse, and SuperSet

Transforming Olympic Data Insights (1/2): End-to-End Analytics with Azure Data Factory, Databricks, Synapse, and Power BI

社区洞察

其他会员也浏览了

Mastering ETL Testing: A Comprehensive Guide

What is ETL (Extract, Transform, Load)?

?? Integrations Unlocked: ETL Pipelines (Part 5) ??

Real-Time vs. Batch ETL Testing: Which is Right for You?

WHAT IS ETL

ETL, ELT and Other Data integration process

WHT IS ETL

ETL, data file exchange and CSV import tools: How to choose the ideal solution for your use case

The ETL Process: From Data Extraction to Business Insights

A Comprehensive Guide to the ETL Process for Data Analytics