Building Transaction Datalake with Hudi and Glue PySpark (Insert| Read| Write| Update| Time Travel | Snapshots| Schema Evolution| Incremental Query)
What is Apache Hudi ?
Apache Hudi (pronounced “hoodie”) is the next generation streaming data lake platform. Apache Hudi brings core warehouse and database functionality directly to a data lake. Hudi provides tables, transactions, efficient upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open source file formats.
Not only is Apache Hudi great for streaming workloads, but it also allows you to create efficient incremental batch pipelines. Read the docs for more use case descriptions and check out who's using Hudi, to see how some of the largest data lakes in the world including Uber, Amazon, ByteDance, Robinhood and more are transforming their production data lakes with Hudi.
Apache Hudi can easily be used on any cloud storage platform. Hudi’s advanced performance optimizations, make analytical workloads faster with any of the popular query engines including, Apache Spark, Flink, Presto, Trino, Hive, etc.
Video based Tutorials
Labs
Step 1: Create Glue Connector
Creating the Apache Hudi connection using AWS Glue Custom Connector.To create your AWS Glue job with an AWS Glue Custom Connector, complete the following steps:
Go to the AWS Glue Studio Console, search for AWS Glue Connector for Apache Hudi and choose AWS Glue Connector for Apache Hudi link.
Choose?Continue to Subscribe.
Review the?Terms and Conditions?and choose the?Accept Terms?button to continue.
Make sure that the subscription is complete and you see the?Effective date?populated next to the product and then choose?Continue to Configuration?button.
As of writing this blog, 0.10.1 is the latest version of the AWS Glue Connector for Apache Hudi. Make sure that?0.10.1 ?is selected in the?Software Version?dropdown and?Activate in AWS Glue Studio?is selected in the?Delivery Method?dropdown. Choose?Continue to Launch?button.
Under?Launch this software, choose?Usage Instructions?and then choose?Activate the Glue connector for Apache Hudi in AWS Glue Studio.
You’re redirected to AWS Glue Studio.
For?Name, enter a name for your connection (for example,?hudi-connection).
For?Description, enter a description.
Choose?Create connection and activate connector.
A message appears that the connection was successfully created, and the connection is now visible on the AWS Glue Studio console.
Step 2: Download and upload the Glue Notebook?
Step 3:?
Explore the Notebooks by executing cells
领英推荐
Changing Table Name ?
CREATE EXTERNAL TABLE `hudi_table`
`_hoodie_commit_time` string COMMENT '',
`_hoodie_commit_seqno` string COMMENT '',
`_hoodie_record_key` string COMMENT '',
`_hoodie_partition_path` string COMMENT '',
`_hoodie_file_name` string COMMENT '',
`emp_id` bigint COMMENT '',
`employee_name` string COMMENT '',
`department` string COMMENT '',
`state` string COMMENT '',
`salary` bigint COMMENT '',
`age` bigint COMMENT '',
`bonus` bigint COMMENT '',
`ts` bigint COMMENT '',
`newfield` bigint COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://soumil-dms-learn/hudi/hudi_table'
TBLPROPERTIES (
'last_commit_time_sync'='20221221215109388',
'numFiles'='3',
'totalSize'='1308895',
'transient_lastDdlTime'='1671658829')(
REMOVE THE LINE
numFiles = '3'
change the tale name
Query Looks like this
CREATE EXTERNAL TABLE `new_hudi_table`
? `_hoodie_commit_time` string COMMENT '',?
? `_hoodie_commit_seqno` string COMMENT '',?
? `_hoodie_record_key` string COMMENT '',?
? `_hoodie_partition_path` string COMMENT '',?
? `_hoodie_file_name` string COMMENT '',?
? `emp_id` bigint COMMENT '',?
? `employee_name` string COMMENT '',?
? `department` string COMMENT '',?
? `state` string COMMENT '',?
? `salary` bigint COMMENT '',?
? `age` bigint COMMENT '',?
? `bonus` bigint COMMENT '',?
? `ts` bigint COMMENT '',?
? `newfield` bigint COMMENT '')
ROW FORMAT SERDE?
? 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'?
STORED AS INPUTFORMAT?
? 'org.apache.hudi.hadoop.HoodieParquetInputFormat'?
OUTPUTFORMAT?
? 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
? 's3://soumil-dms-learn/hudi/hudi_table'
TBLPROPERTIES (
? 'last_commit_time_sync'='20221221215109388',?
? 'numFiles'='3',??
? 'transient_lastDdlTime'='1671658829')(
NOTE S3 path is still same
Learn More HUDI ?
Data Engineering?Software Development?let’s connect
5 个月do you have a similar example with Glue Pyspark?
Big Data Engineer | Certified Databricks Apache Spark Associate Developer | Certified Data Engineering on Microsoft Azure (DP-203) | AWS Data Engineer
9 个月Link to the notebook is not working. Could you please share the new link?