登录查看更多内容

Learn How to use Hudi DeltaStreamer with Hudi 0.14 on AWS Glue: A Seamless Data Ingestion Guide

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

发布日期: 2024年2月29日

In today's data-driven world, the ability to efficiently ingest and process data from various sources is paramount for businesses aiming to stay competitive. Apache Hudi (Hadoop Upserts Deletes and Incrementals) has emerged as a powerful tool for managing large datasets with incremental data ingestion capabilities. With the release of Hudi 0.14, coupled with AWS Glue's robust data transformation and ingestion capabilities, leveraging the Hudi DeltaStreamer becomes even more seamless.

Video Guides

Understanding the Power of DeltaStreamer

DeltaStreamer, a key component of Apache Hudi, simplifies the process of ingesting data into Hudi tables. Its ability to efficiently handle real-time data streams and perform upserts, deletes, and incremental data ingestion makes it a valuable asset for data engineers and architects. By leveraging DeltaStreamer, organizations can streamline their data pipelines and ensure data consistency and reliability.

Setting Up the Environment

Before diving into the intricacies of using Hudi DeltaStreamer with AWS Glue, let's ensure our environment is set up correctly. Follow these steps to get started:

Step 1: Download Dataset and Upload to S3

Grab the dataset from the provided link and upload it to your desired S3 bucket. This dataset will serve as our sample data for ingestion.

https://drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link

Step 2: Download the Required Jar Files

Ensure you have the necessary JAR files, including

jcommander-1.78.jar

hudi-spark3.3-bundle_2.12-0.14.0.jar

hudi-utilities-slim-bundle_2.12-0.14.0.jar.

These files are crucial for running Hudi DeltaStreamer

Download Jar

https://drive.google.com/drive/folders/1Rs9243i-D-jmFHPivlwdtODBRQpG1nys?usp=share_link

Alex Merced 1 个月前

Which Data Pipeline Orchestration Tool Is Right…

Satish Chandra Gupta 2 年前

GroupBy #17: Pinterest’s new wide column database…

Vu Trinh 10 个月前

Step 3: Upload the Code to AWS Glue

Use the provided Scala code snippet to upload the necessary code to AWS Glue. This code initializes the DeltaStreamer and orchestrates the ingestion process seamlessly.

Make sure to add Jar Path in Job config

Ingesting Data with Hudi DeltaStreamer on AWS Glue

The provided Scala code snippet initializes the DeltaStreamer and configures it to ingest data from a specified source to an Apache Hudi table. Here's a brief overview of the key components:

Source Configuration: Specify the source class and ordering field to ensure data consistency during ingestion.
Target Configuration: Define the target base path, table name, and table type (e.g., COPY_ON_WRITE).
Hudi Configuration: Set up Hudi-specific configurations, such as key generator, record key field, partition path field, and source DFS root.
Spark Context Initialization: Initialize the Spark context for executing the data ingestion job.
Glue Context Setup: Set up the Glue context for seamless integration with AWS Glue services.
DeltaStreamer Execution: Execute the DeltaStreamer to synchronize the data from the source to the target Hudi table.

GH: https://github.com/soumilshah1995/deltastreamer-on-glue

Conclusion

Incorporating Hudi DeltaStreamer with Hudi 0.14 on AWS Glue provides a streamlined approach to ingesting and managing large datasets. By following the steps outlined in this guide, data engineers and architects can harness the power of DeltaStreamer to ensure efficient and reliable data ingestion from various sources. Embrace the capabilities of Apache Hudi and AWS Glue to propel your data pipelines towards success.

References

With the seamless integration of Hudi DeltaStreamer with AWS Glue, the possibilities for efficient data management and analytics are endless. Start leveraging these powerful tools today to unlock the full potential of your data infrastructure. Happy ingesting!

Special Thanks

Thank you Aditya Goenkar for all your help

Note:

If you're considering utilizing Hudi DeltaStreamer via the Glue connector, it's worth noting that an AWS Blogs post referenced an older version of Hudi. For a smoother experience, I recommend following a guide that incorporates the latest version of Hudi.

Want to Master Apache Hudi DeltaStreamer ?

Complete 15 + Videos with all Hands on Guide all for free

https://www.youtube.com/watch?v=s42-mGktIpg&list=PLL2hlSFBmWwz2lp7K8dMk8hWV1p_SNeBQ

Arif R.

Data Engineering of Data Lake, Lake House & Data Warehouse

9 个月

you are on a roll my friend, ..

1 次回应

查看更多评论

要查看或添加评论，请登录

Soumil S.的更多文章

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

2024年11月24日

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

Amazon EMR (Elastic MapReduce) is a fully managed service that allows you to process vast amounts of data quickly and…

4 条评论
Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

2024年11月22日

Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

Effortlessly manage table syncing in multiple formats (Hudi, Delta, Iceberg) with this innovative AWS architecture…

8 条评论
Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

2024年11月21日

Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

In the world of modern data architectures, it is not uncommon to find multiple databases in use across an organization.…

4 条评论
Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

2024年11月17日

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Introduction: In the world of data engineering, organizing and managing data through a well-defined architecture is…

4 条评论
Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

2024年11月8日

Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

Introduction In today’s data-driven world, handling large volumes of data efficiently is critical. When data arrives…

1 条评论
How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

2024年11月3日

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

In today's fast-paced data-driven world, maintaining a reliable and efficient data pipeline is crucial. Apache Iceberg,…
Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

2024年10月26日

Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

In the realm of data engineering, managing large datasets can be a daunting task. Organizations are increasingly…

2 条评论
Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

2024年10月20日

Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

Apache Polaris is an emerging open-source project designed to simplify and enhance cataloging, management, and access…
No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

2024年9月30日

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

In today's data-driven world, the ability to handle unstructured data is paramount. Organizations increasingly rely on…
Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

2024年9月29日

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

Handling large amounts of semi-structured data, such as JSON, is a challenge for many data engineers. Whether you’re…

2 条评论

See all articles

Learn How to use Hudi DeltaStreamer with Hudi 0.14 on AWS Glue: A Seamless Data Ingestion Guide

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

Understanding the Power of DeltaStreamer

Setting Up the Environment

领英推荐

Ingesting Data with Hudi DeltaStreamer on AWS Glue

Conclusion

References

Special Thanks

Note:

Want to Master Apache Hudi DeltaStreamer ?

Soumil S.的更多文章

社区洞察

其他会员也浏览了

GroupBy #17: Pinterest’s new wide column database using RocksDB, Fault tolerance Kafka on Kubernetes at Grab

Working with Semi-Structured JSON Data in Databricks

DATA Pill #061 - Apache Celeborn, 8 Futuristic Databases to Watch in 2023

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

Using Airbyte with Tabular

Master Apache Hudi Streamer: 15+ Hands-On Labs, Exercise Materials, and Videos - The Go-To Guide for Companies, Data Leaders, Engineers, and Developer

DoubleCloud’s 14th Product Update

Building a Universal Data Lake with EMR Serverless: Hands-On Labs for Querying with Snowflake, Athena, and Spark – A Guide for Beginners, Leaders

Understanding the Power of DeltaStreamer

Setting Up the Environment

领英推荐

Ingesting Data with Hudi DeltaStreamer on AWS Glue

Conclusion

References

Special Thanks

Note:

Want to Master Apache Hudi DeltaStreamer ?

Soumil S.的更多文章

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

社区洞察

其他会员也浏览了

GroupBy #17: Pinterest’s new wide column database using RocksDB, Fault tolerance Kafka on Kubernetes at Grab

Working with Semi-Structured JSON Data in Databricks

DATA Pill #061 - Apache Celeborn, 8 Futuristic Databases to Watch in 2023

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

Using Airbyte with Tabular

Master Apache Hudi Streamer: 15+ Hands-On Labs, Exercise Materials, and Videos - The Go-To Guide for Companies, Data Leaders, Engineers, and Developer

DoubleCloud’s 14th Product Update

Building a Universal Data Lake with EMR Serverless: Hands-On Labs for Querying with Snowflake, Athena, and Spark – A Guide for Beginners, Leaders