登录查看更多内容

Mastering Incremental ETL with DeltaStreamer and SQL-Based Transformer

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

发布日期: 2024年3月17日

In the realm of data engineering, Incremental Extract, Transform, Load (ETL) processes have become indispensable for businesses dealing with large volumes of data. These processes enable organizations to efficiently handle data updates and additions without reprocessing entire datasets. In this blog post, we'll delve into the realm of Incremental ETL using DeltaStreamer and SQL-Based Transformer, focusing on joining Hudi tables with other Glue tables to build denormalized Hudi tables.

Understanding DeltaStreamer and Its Ease of Setup

DeltaStreamer is a powerful tool that simplifies the process of ingesting data into Hudi tables from various sources such as Parquet, JSON, CSV, Kafka, etc. Its ease of setup and flexibility make it an ideal choice for implementing Incremental ETL pipelines. With DeltaStreamer, developers can effortlessly configure ingestion pipelines tailored to their specific use cases.

The Significance of Incremental ETL

Before diving into the technical details, it's crucial to understand the importance of Incremental ETL. Traditional ETL processes involve extracting, transforming, and loading entire datasets, which can be time-consuming and resource-intensive, especially for large datasets. Incremental ETL, on the other hand, only processes the changes or additions to the data, significantly reducing processing time and resource consumption. This approach is particularly beneficial for real-time analytics, where timely insights are paramount.

Case Study: Uber's Lakehouse Architecture

A notable example of the importance of Incremental ETL can be found in Uber's Lakehouse architecture. Uber leverages Incremental ETL processes to efficiently manage and analyze vast amounts of data generated by its ride-sharing platform. By implementing Incremental ETL pipelines, Uber can continuously update its data lake with minimal latency, enabling real-time analytics and decision-making.

Building Denormalized Hudi Tables with DeltaStreamer

To illustrate the concept of Incremental ETL with DeltaStreamer and SQL-Based Transformer, let's consider a hypothetical scenario involving two tables: customer and orders.

Video Guide with Labs

https://www.youtube.com/watch?v=Jh81BOGjUY0

Customer Table

Orders Table

Use This code to create two sample hudi tables with glue as hive metastore

https://github.com/soumilshah1995/DeltaHudiTransformations

In this scenario, we aim to incrementally fetch orders and join them with customer data to build denormalized Hudi tables.

Implementation Using DeltaStreamer

How We will be joining data

领英推荐

Introduction to ETL/ELT (Part 1)

Data & Analytics 1 年前

Detailed Guide on DataBricks Delta?Lake- Part 1

Krishna Yogi Kolluru 8 个月前

The ETL to ELT to EtLT Evolution, and data pipelines

Ascend.io 1 年前

--source-class: Specifies the source class responsible for providing data to the DeltaStreamer. In this case, we're using org.apache.hudi.utilities.sources.HoodieIncrSource, which indicates that we're fetching data from a Hudi table incrementally. This means that DeltaStreamer will only process new or updated records since the last checkpoint.
--source-ordering-field: Defines the field used for ordering the data during ingestion. In this example, the data will be ordered based on the order_date field.
--target-base-path: Specifies the base path where the output data will be stored. Here, the output will be stored in the s3://datalake-hudi-test-buckets/silver/ directory.
--target-table: Defines the target table name where the data will be stored. In this case, the data will be stored in the orders table.
--transformer-class: Specifies the transformer class responsible for transforming the data before writing it to the target table. Here, we're using org.apache.hudi.utilities.transform.SqlQueryBasedTransformer, indicating that the data will be transformed using an SQL query.
--table-type: Specifies the type of table to be created. In this case, COPY_ON_WRITE indicates that a copy-on-write Hudi table will be created.
--hoodie-conf: Defines various Hudi-specific configurations using the hoodie.conf prefix.

hoodie.streamer.source.hoodieincr.path: Specifies the path to the Hudi table from which incremental data will be fetched. This path points to the Hudi bronze table where new data is ingested.

hoodie.streamer.source.hoodieincr.missing.checkpoint.strategy: Specifies the strategy to handle missing checkpoints. Here, READ_UPTO_LATEST_COMMIT indicates that the streamer will read up to the latest commit

hoodie.deltastreamer.transformer.sql: Specifies the SQL query used for transforming the data before ingestion into the target table. This query joins the incoming data with the customers table based on the customer_id.

Hands on Labs with Exercise file

https://github.com/soumilshah1995/DeltaHudiTransformations/blob/main/README.md

Conclusion

Incremental ETL plays a pivotal role in modern data engineering, enabling organizations to efficiently manage and analyze large volumes of data. By leveraging tools like DeltaStreamer and SQL-Based Transformer, businesses can implement robust Incremental ETL pipelines tailored to their specific needs. As demonstrated in this blog post, Incremental ETL with DeltaStreamer empowers organizations to build denormalized Hudi tables seamlessly, facilitating real-time analytics and decision-making.

Incorporating Incremental ETL practices, as exemplified by DeltaStreamer, is not just a technical necessity but a strategic imperative for businesses striving to harness the full potential of their data.

References

https://www.uber.com/en-US/blog/ubers-lakehouse-architecture/

https://www.youtube.com/watch?v=PLYgUOzTnJ8

https://www.dhirubhai.net/pulse/efficiently-managing-ride-late-arriving-tips-data-incremental-shah/

Koenraad Block

Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance

8 个月

Your post brilliantly captures the essence of ETL. Thanks for sharing! ????

查看更多评论

要查看或添加评论，请登录

查看全部

Mastering Incremental ETL with DeltaStreamer and SQL-Based Transformer

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

Understanding DeltaStreamer and Its Ease of Setup

The Significance of Incremental ETL

Case Study: Uber's Lakehouse Architecture

Building Denormalized Hudi Tables with DeltaStreamer

Video Guide with Labs

Customer Table

Orders Table

Implementation Using DeltaStreamer

领英推荐

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

ETL IS DEAD

Reverse ETL vs. ETL

To hell and back with ETL. The unstoppable rise of data warehouse automation.

Efficient Data Ingestion with Glue Concurrency: Using a Single Template for Multiple S3 Tables into a Transactional Hudi Data Lake

The Evolution of ETL (Extract, Transform, Load) Processes: A Journey from Simplicity to Innovation

DBT vs. Traditional ETL Tools: A Comparative Analysis

ETL or ELT?

Transforming Raw Data into Actionable Insights Using Advanced ETL Techniques

?? Automating Data Extraction from Client Directories: Streamlining ETL with PowerShell

Tools Worth Reviewing When Building ETL Solutions

Understanding DeltaStreamer and Its Ease of Setup

The Significance of Incremental ETL

Case Study: Uber's Lakehouse Architecture

Building Denormalized Hudi Tables with DeltaStreamer

Video Guide with Labs

Customer Table

Orders Table

Implementation Using DeltaStreamer

领英推荐

Conclusion

Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

2024年11月8日

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

2024年11月3日

Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

2024年10月26日

Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

2024年10月20日

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

2024年9月30日

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

2024年9月29日

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

2024年9月21日

Building a Universal Data Lake with EMR Serverless: Hands-On Labs for Querying with Snowflake, Athena, and Spark – A Guide for Beginners, Leaders

2024年9月15日

How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide

2024年9月8日

Developer Guide: How to Submit Hudi PySpark(Python) Jobs to EMR Serverless (7.1.0) with AWS Glue Hive MetaStore

2024年9月4日

社区洞察

其他会员也浏览了

ETL IS DEAD

Reverse ETL vs. ETL

To hell and back with ETL. The unstoppable rise of data warehouse automation.

Efficient Data Ingestion with Glue Concurrency: Using a Single Template for Multiple S3 Tables into a Transactional Hudi Data Lake

The Evolution of ETL (Extract, Transform, Load) Processes: A Journey from Simplicity to Innovation

DBT vs. Traditional ETL Tools: A Comparative Analysis

ETL or ELT?

Transforming Raw Data into Actionable Insights Using Advanced ETL Techniques

?? Automating Data Extraction from Client Directories: Streamlining ETL with PowerShell

Tools Worth Reviewing When Building ETL Solutions