登录查看更多内容

Understanding Apache Hudi's MERGE INTO Command with Minio and HiveMetaStore

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

发布日期: 2024年7月31日

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a data management framework that provides a way to handle large-scale data sets on Apache Hadoop-compatible file systems, such as Amazon S3. One of its powerful features is the ability to perform upserts and incremental data processing efficiently. In this blog post, we will explore the MERGE INTOcommand in Apache Hudi, which is particularly useful for Change Data Capture (CDC) scenarios.

Video Guides

What is MERGE INTO?

The MERGE INTO command in Apache Hudi is used to merge data from a source table or view into a target Hudi table. This command allows you to handle different types of changes in the source data, such as inserts, updates, and deletes, and apply them to the target table. This is crucial for maintaining an up-to-date and accurate dataset, especially in scenarios where data is constantly changing.

When to Use MERGE INTO?

You should use the MERGE INTO command in the following scenarios:

Change Data Capture (CDC): When you need to capture and apply changes (inserts, updates, and deletes) from a source system to a target table.
Data Warehousing / Lakehouse: To keep your data warehouse/LakeHouse tables up-to-date with changes from source systems.
Incremental Data Processing: To process and merge new data into existing datasets efficiently without having to rewrite the entire dataset.

Labs

Step 1: Spin up Stack

Docker compose file can be found on below links

https://github.com/soumilshah1995/hudi-mergeinto-labs/blob/main/README.md

Step 2: Start the Spark SQL Shell

ITIO Innovex Pvt. Ltd. 7 个月前

Big Data: What the Heck are Pig and Hive?

Bernard Marr 9 年前

Despite Uniform and Apache XTable, your choice of…

Alex Merced 5 个月前

Creating the Target Hudi Table: First, we create a Hudi table named customer_target. This table will store customer information.

Create Mock Source Data

Performing the Initial Merge: We use the MERGE INTO command to merge the initial CDC data into the customer_target table.

New Data arriving

Performing the Merge with Updated CDC Data: We use the MERGE INTO command again to merge the updated CDC data into the customer_target table.

Verify

By following these steps, you can effectively use the MERGE INTO command in Apache Hudi to manage your datasets efficiently, ensuring that your target tables are always up-to-date with the latest changes from your source data.

Exercise Labs

https://github.com/soumilshah1995/hudi-mergeinto-labs/blob/main/README.md

Conclusion

In conclusion, Apache Hudi's MERGE INTO command is a powerful tool for handling CDC scenarios, making it easier to maintain accurate and up-to-date datasets in your data lake. Whether you are dealing with data warehousing, incremental data processing, or real-time analytics, MERGE INTO can help you achieve your goals efficiently.

要查看或添加评论，请登录

查看全部

Understanding Apache Hudi's MERGE INTO Command with Minio and HiveMetaStore

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

What is MERGE INTO?

When to Use MERGE INTO?

Labs

Step 1: Spin up Stack

Step 2: Start the Spark SQL Shell

领英推荐

Creating the Target Hudi Table: First, we create a Hudi table named customer_target. This table will store customer information.

Create Mock Source Data

Performing the Initial Merge: We use the MERGE INTO command to merge the initial CDC data into the customer_target table.

New Data arriving

Performing the Merge with Updated CDC Data: We use the MERGE INTO command again to merge the updated CDC data into the customer_target table.

Verify

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Exploring the Different Types of Tables in Apache Hive

A beginner's guide to using Apache Hudi for data lake management

Using Airbyte with Tabular

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

AWS and Open Source Big Data and Analytic Frameworks

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Hive Data Types

SQL has made a Big (Data) comeback!

Delta Lake Format: Understanding Parquet under the hood.

Top 10 Big Data Trends for 2017

What is MERGE INTO?

When to Use MERGE INTO?

Labs

Step 1: Spin up Stack

Step 2: Start the Spark SQL Shell

领英推荐

Creating the Target Hudi Table: First, we create a Hudi table named customer_target. This table will store customer information.

Create Mock Source Data

Performing the Initial Merge: We use the MERGE INTO command to merge the initial CDC data into the customer_target table.

New Data arriving

Performing the Merge with Updated CDC Data: We use the MERGE INTO command again to merge the updated CDC data into the customer_target table.

Verify

Conclusion

How to Use S3 Object Tags for Iceberg Tables Created by EMR Serverless to Move Expired Snapshots into Glacier or Delete Them by life cycle policy

2024年11月27日

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

2024年11月24日

Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

2024年11月22日

Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

2024年11月21日

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

2024年11月17日

Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

2024年11月8日

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

2024年11月3日

Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

2024年10月26日

Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

2024年10月20日

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

2024年9月30日

社区洞察

其他会员也浏览了

Exploring the Different Types of Tables in Apache Hive

A beginner's guide to using Apache Hudi for data lake management

Using Airbyte with Tabular

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

AWS and Open Source Big Data and Analytic Frameworks

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Hive Data Types

SQL has made a Big (Data) comeback!

Delta Lake Format: Understanding Parquet under the hood.

Top 10 Big Data Trends for 2017