Mastering Incremental ETL with DeltaStreamer and SQL-Based Transformer

Mastering Incremental ETL with DeltaStreamer and SQL-Based Transformer

In the realm of data engineering, Incremental Extract, Transform, Load (ETL) processes have become indispensable for businesses dealing with large volumes of data. These processes enable organizations to efficiently handle data updates and additions without reprocessing entire datasets. In this blog post, we'll delve into the realm of Incremental ETL using DeltaStreamer and SQL-Based Transformer, focusing on joining Hudi tables with other Glue tables to build denormalized Hudi tables.

Understanding DeltaStreamer and Its Ease of Setup

DeltaStreamer is a powerful tool that simplifies the process of ingesting data into Hudi tables from various sources such as Parquet, JSON, CSV, Kafka, etc. Its ease of setup and flexibility make it an ideal choice for implementing Incremental ETL pipelines. With DeltaStreamer, developers can effortlessly configure ingestion pipelines tailored to their specific use cases.

The Significance of Incremental ETL

Before diving into the technical details, it's crucial to understand the importance of Incremental ETL. Traditional ETL processes involve extracting, transforming, and loading entire datasets, which can be time-consuming and resource-intensive, especially for large datasets. Incremental ETL, on the other hand, only processes the changes or additions to the data, significantly reducing processing time and resource consumption. This approach is particularly beneficial for real-time analytics, where timely insights are paramount.

Case Study: Uber's Lakehouse Architecture

A notable example of the importance of Incremental ETL can be found in Uber's Lakehouse architecture. Uber leverages Incremental ETL processes to efficiently manage and analyze vast amounts of data generated by its ride-sharing platform. By implementing Incremental ETL pipelines, Uber can continuously update its data lake with minimal latency, enabling real-time analytics and decision-making.

Building Denormalized Hudi Tables with DeltaStreamer

To illustrate the concept of Incremental ETL with DeltaStreamer and SQL-Based Transformer, let's consider a hypothetical scenario involving two tables: customer and orders.


Video Guide with Labs

https://www.youtube.com/watch?v=Jh81BOGjUY0


Customer Table

Orders Table

Use This code to create two sample hudi tables with glue as hive metastore

https://github.com/soumilshah1995/DeltaHudiTransformations

In this scenario, we aim to incrementally fetch orders and join them with customer data to build denormalized Hudi tables.

Implementation Using DeltaStreamer

How We will be joining data



  1. --source-class: Specifies the source class responsible for providing data to the DeltaStreamer. In this case, we're using org.apache.hudi.utilities.sources.HoodieIncrSource, which indicates that we're fetching data from a Hudi table incrementally. This means that DeltaStreamer will only process new or updated records since the last checkpoint.
  2. --source-ordering-field: Defines the field used for ordering the data during ingestion. In this example, the data will be ordered based on the order_date field.
  3. --target-base-path: Specifies the base path where the output data will be stored. Here, the output will be stored in the s3://datalake-hudi-test-buckets/silver/ directory.
  4. --target-table: Defines the target table name where the data will be stored. In this case, the data will be stored in the orders table.
  5. --transformer-class: Specifies the transformer class responsible for transforming the data before writing it to the target table. Here, we're using org.apache.hudi.utilities.transform.SqlQueryBasedTransformer, indicating that the data will be transformed using an SQL query.
  6. --table-type: Specifies the type of table to be created. In this case, COPY_ON_WRITE indicates that a copy-on-write Hudi table will be created.
  7. --hoodie-conf: Defines various Hudi-specific configurations using the hoodie.conf prefix.


hoodie.streamer.source.hoodieincr.path: Specifies the path to the Hudi table from which incremental data will be fetched. This path points to the Hudi bronze table where new data is ingested.

hoodie.streamer.source.hoodieincr.missing.checkpoint.strategy: Specifies the strategy to handle missing checkpoints. Here, READ_UPTO_LATEST_COMMIT indicates that the streamer will read up to the latest commit

hoodie.deltastreamer.transformer.sql: Specifies the SQL query used for transforming the data before ingestion into the target table. This query joins the incoming data with the customers table based on the customer_id.


Hands on Labs with Exercise file

https://github.com/soumilshah1995/DeltaHudiTransformations/blob/main/README.md

Conclusion

Incremental ETL plays a pivotal role in modern data engineering, enabling organizations to efficiently manage and analyze large volumes of data. By leveraging tools like DeltaStreamer and SQL-Based Transformer, businesses can implement robust Incremental ETL pipelines tailored to their specific needs. As demonstrated in this blog post, Incremental ETL with DeltaStreamer empowers organizations to build denormalized Hudi tables seamlessly, facilitating real-time analytics and decision-making.

Incorporating Incremental ETL practices, as exemplified by DeltaStreamer, is not just a technical necessity but a strategic imperative for businesses striving to harness the full potential of their data.


References

https://www.uber.com/en-US/blog/ubers-lakehouse-architecture/

https://www.youtube.com/watch?v=PLYgUOzTnJ8

https://www.dhirubhai.net/pulse/efficiently-managing-ride-late-arriving-tips-data-incremental-shah/




Koenraad Block

Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance

8 个月

Your post brilliantly captures the essence of ETL. Thanks for sharing! ????

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了