Building  Transaction Datalake with Hudi and Glue PySpark (Insert| Read| Write| Update| Time Travel | Snapshots| Schema Evolution| Incremental Query)

Building Transaction Datalake with Hudi and Glue PySpark (Insert| Read| Write| Update| Time Travel | Snapshots| Schema Evolution| Incremental Query)

What is Apache Hudi ?

Apache Hudi (pronounced “hoodie”) is the next generation streaming data lake platform. Apache Hudi brings core warehouse and database functionality directly to a data lake. Hudi provides tables, transactions, efficient upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open source file formats.

Not only is Apache Hudi great for streaming workloads, but it also allows you to create efficient incremental batch pipelines. Read the docs for more use case descriptions and check out who's using Hudi, to see how some of the largest data lakes in the world including Uber, Amazon, ByteDance, Robinhood and more are transforming their production data lakes with Hudi.

Apache Hudi can easily be used on any cloud storage platform. Hudi’s advanced performance optimizations, make analytical workloads faster with any of the popular query engines including, Apache Spark, Flink, Presto, Trino, Hive, etc.

Video based Tutorials


Labs

Step 1: Create Glue Connector

Creating the Apache Hudi connection using AWS Glue Custom Connector.To create your AWS Glue job with an AWS Glue Custom Connector, complete the following steps:

Go to the AWS Glue Studio Console, search for AWS Glue Connector for Apache Hudi and choose AWS Glue Connector for Apache Hudi link.

No alt text provided for this image

Choose?Continue to Subscribe.

No alt text provided for this image

Review the?Terms and Conditions?and choose the?Accept Terms?button to continue.

No alt text provided for this image

Make sure that the subscription is complete and you see the?Effective date?populated next to the product and then choose?Continue to Configuration?button.

No alt text provided for this image

As of writing this blog, 0.10.1 is the latest version of the AWS Glue Connector for Apache Hudi. Make sure that?0.10.1 ?is selected in the?Software Version?dropdown and?Activate in AWS Glue Studio?is selected in the?Delivery Method?dropdown. Choose?Continue to Launch?button.

No alt text provided for this image

Under?Launch this software, choose?Usage Instructions?and then choose?Activate the Glue connector for Apache Hudi in AWS Glue Studio.

No alt text provided for this image

You’re redirected to AWS Glue Studio.

For?Name, enter a name for your connection (for example,?hudi-connection).

For?Description, enter a description.

No alt text provided for this image

Choose?Create connection and activate connector.

A message appears that the connection was successfully created, and the connection is now visible on the AWS Glue Studio console.

No alt text provided for this image

Step 2: Download and upload the Glue Notebook?

Notebook Link https://github.com/soumilshah1995/Insert-Update-Read-Write-SnapShot-Time-Travel-incremental-Query-on-APache-Hudi-transacti/archive/refs/heads/main.zip


No alt text provided for this image

Step 3:?

Explore the Notebooks by executing cells

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Changing Table Name ?

No alt text provided for this image
No alt text provided for this image


CREATE EXTERNAL TABLE `hudi_table`
  `_hoodie_commit_time` string COMMENT '', 
  `_hoodie_commit_seqno` string COMMENT '', 
  `_hoodie_record_key` string COMMENT '', 
  `_hoodie_partition_path` string COMMENT '', 
  `_hoodie_file_name` string COMMENT '', 
  `emp_id` bigint COMMENT '', 
  `employee_name` string COMMENT '', 
  `department` string COMMENT '', 
  `state` string COMMENT '', 
  `salary` bigint COMMENT '', 
  `age` bigint COMMENT '', 
  `bonus` bigint COMMENT '', 
  `ts` bigint COMMENT '', 
  `newfield` bigint COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://soumil-dms-learn/hudi/hudi_table'
TBLPROPERTIES (
  'last_commit_time_sync'='20221221215109388', 
  'numFiles'='3', 
  'totalSize'='1308895', 
  'transient_lastDdlTime'='1671658829')(        

REMOVE THE LINE

numFiles = '3'

change the tale name

Query Looks like this


CREATE EXTERNAL TABLE `new_hudi_table`
? `_hoodie_commit_time` string COMMENT '',?
? `_hoodie_commit_seqno` string COMMENT '',?
? `_hoodie_record_key` string COMMENT '',?
? `_hoodie_partition_path` string COMMENT '',?
? `_hoodie_file_name` string COMMENT '',?
? `emp_id` bigint COMMENT '',?
? `employee_name` string COMMENT '',?
? `department` string COMMENT '',?
? `state` string COMMENT '',?
? `salary` bigint COMMENT '',?
? `age` bigint COMMENT '',?
? `bonus` bigint COMMENT '',?
? `ts` bigint COMMENT '',?
? `newfield` bigint COMMENT '')
ROW FORMAT SERDE?
? 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'?
STORED AS INPUTFORMAT?
? 'org.apache.hudi.hadoop.HoodieParquetInputFormat'?
OUTPUTFORMAT?
? 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
? 's3://soumil-dms-learn/hudi/hudi_table'
TBLPROPERTIES (
? 'last_commit_time_sync'='20221221215109388',?
? 'numFiles'='3',??
? 'transient_lastDdlTime'='1671658829')(        

NOTE S3 path is still same

No alt text provided for this image
No alt text provided for this image



Learn More HUDI ?





Parikshit Nain

Data Engineering?Software Development?let’s connect

5 个月

do you have a similar example with Glue Pyspark?

回复
Federico Manuel Dlouky

Big Data Engineer | Certified Databricks Apache Spark Associate Developer | Certified Data Engineering on Microsoft Azure (DP-203) | AWS Data Engineer

9 个月

Link to the notebook is not working. Could you please share the new link?

回复

要查看或添加评论,请登录

Soumil S.的更多文章

社区洞察

其他会员也浏览了