登录查看更多内容

Building Transaction Datalake with Hudi and Glue PySpark (Insert| Read| Write| Update| Time Travel | Snapshots| Schema Evolution| Incremental Query)

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

发布日期: 2022年12月21日

What is Apache Hudi ?

Apache Hudi (pronounced “hoodie”) is the next generation streaming data lake platform. Apache Hudi brings core warehouse and database functionality directly to a data lake. Hudi provides tables, transactions, efficient upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open source file formats.

Not only is Apache Hudi great for streaming workloads, but it also allows you to create efficient incremental batch pipelines. Read the docs for more use case descriptions and check out who's using Hudi, to see how some of the largest data lakes in the world including Uber, Amazon, ByteDance, Robinhood and more are transforming their production data lakes with Hudi.

Apache Hudi can easily be used on any cloud storage platform. Hudi’s advanced performance optimizations, make analytical workloads faster with any of the popular query engines including, Apache Spark, Flink, Presto, Trino, Hive, etc.

Video based Tutorials

Labs

Step 1: Create Glue Connector

Creating the Apache Hudi connection using AWS Glue Custom Connector.To create your AWS Glue job with an AWS Glue Custom Connector, complete the following steps:

Go to the AWS Glue Studio Console, search for AWS Glue Connector for Apache Hudi and choose AWS Glue Connector for Apache Hudi link.

Choose?Continue to Subscribe.

Review the?Terms and Conditions?and choose the?Accept Terms?button to continue.

Make sure that the subscription is complete and you see the?Effective date?populated next to the product and then choose?Continue to Configuration?button.

As of writing this blog, 0.10.1 is the latest version of the AWS Glue Connector for Apache Hudi. Make sure that?0.10.1 ?is selected in the?Software Version?dropdown and?Activate in AWS Glue Studio?is selected in the?Delivery Method?dropdown. Choose?Continue to Launch?button.

Under?Launch this software, choose?Usage Instructions?and then choose?Activate the Glue connector for Apache Hudi in AWS Glue Studio.

You’re redirected to AWS Glue Studio.

For?Name, enter a name for your connection (for example,?hudi-connection).

For?Description, enter a description.

Choose?Create connection and activate connector.

A message appears that the connection was successfully created, and the connection is now visible on the AWS Glue Studio console.

Step 2: Download and upload the Glue Notebook?

Notebook Link https://github.com/soumilshah1995/Insert-Update-Read-Write-SnapShot-Time-Travel-incremental-Query-on-APache-Hudi-transacti/archive/refs/heads/main.zip

Step 3:?

Explore the Notebooks by executing cells

领英推荐

SQL Based Rollups for Streaming Data

Venkat Venkataramani 3 年前

Optimizing Databricks Workloads: The newest book…

Celebal Technologies 2 年前

PySpark on AWS EMR: A Guide to Efficient ETL Processing

Coditation 1 年前

Changing Table Name ?

CREATE EXTERNAL TABLE `hudi_table`
  `_hoodie_commit_time` string COMMENT '', 
  `_hoodie_commit_seqno` string COMMENT '', 
  `_hoodie_record_key` string COMMENT '', 
  `_hoodie_partition_path` string COMMENT '', 
  `_hoodie_file_name` string COMMENT '', 
  `emp_id` bigint COMMENT '', 
  `employee_name` string COMMENT '', 
  `department` string COMMENT '', 
  `state` string COMMENT '', 
  `salary` bigint COMMENT '', 
  `age` bigint COMMENT '', 
  `bonus` bigint COMMENT '', 
  `ts` bigint COMMENT '', 
  `newfield` bigint COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://soumil-dms-learn/hudi/hudi_table'
TBLPROPERTIES (
  'last_commit_time_sync'='20221221215109388', 
  'numFiles'='3', 
  'totalSize'='1308895', 
  'transient_lastDdlTime'='1671658829')(

REMOVE THE LINE

numFiles = '3'

change the tale name

Query Looks like this

CREATE EXTERNAL TABLE `new_hudi_table`
? `_hoodie_commit_time` string COMMENT '',?
? `_hoodie_commit_seqno` string COMMENT '',?
? `_hoodie_record_key` string COMMENT '',?
? `_hoodie_partition_path` string COMMENT '',?
? `_hoodie_file_name` string COMMENT '',?
? `emp_id` bigint COMMENT '',?
? `employee_name` string COMMENT '',?
? `department` string COMMENT '',?
? `state` string COMMENT '',?
? `salary` bigint COMMENT '',?
? `age` bigint COMMENT '',?
? `bonus` bigint COMMENT '',?
? `ts` bigint COMMENT '',?
? `newfield` bigint COMMENT '')
ROW FORMAT SERDE?
? 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'?
STORED AS INPUTFORMAT?
? 'org.apache.hudi.hadoop.HoodieParquetInputFormat'?
OUTPUTFORMAT?
? 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
? 's3://soumil-dms-learn/hudi/hudi_table'
TBLPROPERTIES (
? 'last_commit_time_sync'='20221221215109388',?
? 'numFiles'='3',??
? 'transient_lastDdlTime'='1671658829')(

NOTE S3 path is still same

Learn More HUDI ?

Parikshit Nain

Data Engineering?Software Development?let’s connect

5 个月

do you have a similar example with Glue Pyspark?

Federico Manuel Dlouky

Big Data Engineer | Certified Databricks Apache Spark Associate Developer | Certified Data Engineering on Microsoft Azure (DP-203) | AWS Data Engineer

9 个月

Link to the notebook is not working. Could you please share the new link?

查看更多评论

要查看或添加评论，请登录

Soumil S.的更多文章

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

2024年11月24日

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

Amazon EMR (Elastic MapReduce) is a fully managed service that allows you to process vast amounts of data quickly and…

4 条评论
Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

2024年11月22日

Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

Effortlessly manage table syncing in multiple formats (Hudi, Delta, Iceberg) with this innovative AWS architecture…

8 条评论
Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

2024年11月21日

Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

In the world of modern data architectures, it is not uncommon to find multiple databases in use across an organization.…

4 条评论
Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

2024年11月17日

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Introduction: In the world of data engineering, organizing and managing data through a well-defined architecture is…

4 条评论
Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

2024年11月8日

Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

Introduction In today’s data-driven world, handling large volumes of data efficiently is critical. When data arrives…

1 条评论
How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

2024年11月3日

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

In today's fast-paced data-driven world, maintaining a reliable and efficient data pipeline is crucial. Apache Iceberg,…
Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

2024年10月26日

Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

In the realm of data engineering, managing large datasets can be a daunting task. Organizations are increasingly…

2 条评论
Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

2024年10月20日

Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

Apache Polaris is an emerging open-source project designed to simplify and enhance cataloging, management, and access…
No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

2024年9月30日

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

In today's data-driven world, the ability to handle unstructured data is paramount. Organizations increasingly rely on…
Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

2024年9月29日

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

Handling large amounts of semi-structured data, such as JSON, is a challenge for many data engineers. Whether you’re…

2 条评论

See all articles

Building Transaction Datalake with Hudi and Glue PySpark (Insert| Read| Write| Update| Time Travel | Snapshots| Schema Evolution| Incremental Query)

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

领英推荐

Soumil S.的更多文章

社区洞察

其他会员也浏览了

PySpark on AWS EMR: A Guide to Efficient ETL Processing

AWS Services Every Developer Should Be Aware Of

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

DATA Pill #078 - Streaming SQL in Data Mesh, Databricks + Arcion, BigQuery is much cheaper than you think

66% say AWS is the most required platform in job descriptions

Architecture Powering Down Stream System with CDC from HUDI Transactional Datalake

Bigquery

Databricks Photon and its relation to Apache Spark

How to Extract All YouTube Comments and Comment Replies from a Playlist: Performed the ETL Unstructured Data into Structured Data-A Step-by-Step Guide

DataBricks

领英推荐

Soumil S.的更多文章

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

社区洞察

其他会员也浏览了

PySpark on AWS EMR: A Guide to Efficient ETL Processing

AWS Services Every Developer Should Be Aware Of

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

DATA Pill #078 - Streaming SQL in Data Mesh, Databricks + Arcion, BigQuery is much cheaper than you think

66% say AWS is the most required platform in job descriptions

Architecture Powering Down Stream System with CDC from HUDI Transactional Datalake

Bigquery

Databricks Photon and its relation to Apache Spark

How to Extract All YouTube Comments and Comment Replies from a Playlist: Performed the ETL Unstructured Data into Structured Data-A Step-by-Step Guide

DataBricks