登录查看更多内容

Developer Guide: How to Submit Hudi PySpark(Python) Jobs to EMR Serverless (7.1.0) with AWS Glue Hive MetaStore

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

发布日期: 2024年9月4日

Apache Hudi is a powerful data management framework that brings ACID transactions to data lakes. Running Hudi jobs on Amazon EMR Serverless combines the scalability of EMR with the flexibility of serverless computing. This step-by-step guide will walk you through submitting a Hudi PySpark job to EMR Serverless.

Video Guides

Step 1: Create an EMR Serverless Application

Before running any jobs, you need to create an EMR Serverless application. This is a one-time setup step:

Open the AWS Management Console and navigate to EMR Serverless.
Create a new application:Select Spark as the framework.Define your application name.Configure network settings and logging (e.g., S3 bucket for logs).

Click on Create Cluster

Verify the cluster and copy application ID

Step 2: Write Your Hudi PySpark Job

Upload the File to S3

Export ENV variables

Submit Job

Output

Query Athena

Exercise Files

https://github.com/soumilshah1995/emr-hudi-getting-started/blob/main/README.md

Conclusion

Running Hudi jobs on EMR Serverless simplifies the management of large-scale data lakes by leveraging serverless computing. By following this guide, you’ve set up and run a Hudi PySpark job on EMR Serverless, taking advantage of both the flexibility of PySpark and the robustness of Hudi.

Feel free to customize the script and explore more Hudi functionalities to suit your data processing needs.

要查看或添加评论，请登录

查看全部

Developer Guide: How to Submit Hudi PySpark(Python) Jobs to EMR Serverless (7.1.0) with AWS Glue Hive MetaStore

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

Step 1: Create an EMR Serverless Application

Step 2: Write Your Hudi PySpark Job

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Introduction to Databricks

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

Understanding the PySpark

Unlocking Incremental Data in PySpark: Extracting from JDBC Sources without Debezium or AWS DMS with CDC

Understanding DStreams in Apache Spark

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

Demystifying Resilient Distributed Datasets (RDD) in Apache Spark

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

What is Apache Spark ?

Step 1: Create an EMR Serverless Application

Step 2: Write Your Hudi PySpark Job

Conclusion

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

2024年11月24日

Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

2024年11月22日

Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

2024年11月21日

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

2024年11月17日

Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

2024年11月8日

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

2024年11月3日

Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

2024年10月26日

Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

2024年10月20日

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

2024年9月30日

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

2024年9月29日

社区洞察

其他会员也浏览了

Introduction to Databricks

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

Understanding the PySpark

Unlocking Incremental Data in PySpark: Extracting from JDBC Sources without Debezium or AWS DMS with CDC

Understanding DStreams in Apache Spark

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

Demystifying Resilient Distributed Datasets (RDD) in Apache Spark

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

What is Apache Spark ?