登录查看更多内容

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2025年2月8日

n this blog, we'll walk through creating and managing an EMR (Elastic MapReduce) cluster on EC2 to run PySpark jobs using AWS Step Functions. We will employ an asynchronous callback pattern, ensuring that the cluster is terminated automatically after the job completes.

Overview

The steps involved include:

Creating a Step Function to manage the lifecycle of an EMR cluster.
Uploading necessary files, including bootstrap scripts and PySpark job scripts, to S3.
Configuring the cluster and running the job.
Automatically terminating the EMR cluster upon job completion.

Video Guides

Step 1: Create a Step Function

Below is the JSON definition for the Step Function:

Code to set function JSON: LINK

Step 2: Upload Files to S3

To prepare for the EMR job, upload the necessary bootstrap and job scripts to an S3 bucket.

Bootstrap Script

Below is a sample bootstrap script to install dependencies on the cluster:

Sample PySpark Script

Upload both of these scripts on S3

Step 3: Verify S3 Files

Ensure that the files have been successfully uploaded to your S3 bucket. Navigate to the AWS S3 Console and check the scripts directory for the bootstrap.sh and test_job.py files.

Step 4: Submit the EMR Job

Here is a sample payload for submitting the EMR job through Step Functions:

Conclusion

By leveraging AWS Step Functions to manage EMR clusters, you can automate the lifecycle of your PySpark jobs on AWS. The setup ensures that resources are utilized efficiently, with the cluster being terminated as soon as the job completes, reducing costs and operational overhead.

Shannon Atkinson

DevOps & Automation Expert | Kubernetes, Docker, CI/CD Pipelines, Terraform | Cloud Specialist (AWS, Azure, GCP) | AI & ML Innovator | Patent Holder & Certified Jenkins Engineer

2 周

Such an efficient approach. Optimizing costs while streamlining processes is crucial for success in data engineering. #Cloud

Vizcon

2 周

This is a brilliant approach to simplify data processing. Streamlined operations save both time and resources. ??

查看更多评论

要查看或添加评论，请登录

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论
Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

2025年2月14日

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Introduction In modern big data applications, managing concurrent writes to distributed storage systems like Amazon S3…

1 条评论
Leveraging S3 for Distributed Concurrency Control in Data Processing

2025年2月9日

Leveraging S3 for Distributed Concurrency Control in Data Processing

In distributed systems, managing concurrency—ensuring that only a set number of processes run in parallel—is crucial to…
Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

2025年1月25日

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

If you have existing Iceberg tables and need to sync them with the AWS Glue Data Catalog, the iceberg-glue-syncPython…

1 条评论
Learn How to Connect to the Glue Data Catalog Using AWS Glue Iceberg REST Endpoint

2025年1月25日

Learn How to Connect to the Glue Data Catalog Using AWS Glue Iceberg REST Endpoint

The integration of Apache Iceberg with AWS Glue provides a powerful mechanism to handle large-scale data operations on…

See all articles

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

Overview

Step 1: Create a Step Function

Step 2: Upload Files to S3

Bootstrap Script

Sample PySpark Script

Step 3: Verify S3 Files

Step 4: Submit the EMR Job

Conclusion

Soumil S.的更多文章

社区洞察

其他会员也浏览了

66% say AWS is the most required platform in job descriptions

What is Databricks?

Week 21 (20 May - 26 May)

Client Success Story: Unleashing the Power of AI and Big Data: Building Kudala's Private Multi-Tenant Cloud with Kubernetes

Which database is best for machine learning?

Big Data on AWS: The Big Picture training

Scale with a K.I.S.S: Keep It Simple, Stupid

GroupBy #14: What it takes to be a Senior IC at Meta, Netflix Data Engineering Summit

Day In The Life Of An AWS Data & Machine Learning Engineer

Understanding the PySpark

Overview

Step 1: Create a Step Function

Step 2: Upload Files to S3

Bootstrap Script

Sample PySpark Script

Step 3: Verify S3 Files

Step 4: Submit the EMR Job

Conclusion

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Leveraging S3 for Distributed Concurrency Control in Data Processing

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

Learn How to Connect to the Glue Data Catalog Using AWS Glue Iceberg REST Endpoint

社区洞察

其他会员也浏览了

66% say AWS is the most required platform in job descriptions

What is Databricks?

Week 21 (20 May - 26 May)

Client Success Story: Unleashing the Power of AI and Big Data: Building Kudala's Private Multi-Tenant Cloud with Kubernetes

Which database is best for machine learning?

Big Data on AWS: The Big Picture training

Scale with a K.I.S.S: Keep It Simple, Stupid

GroupBy #14: What it takes to be a Senior IC at Meta, Netflix Data Engineering Summit

Day In The Life Of An AWS Data & Machine Learning Engineer

Understanding the PySpark