登录查看更多内容

How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

发布日期: 2024年9月8日

Introduction

Running PySpark jobs on AWS EMR Serverless provides a scalable and cost-efficient way to process large datasets without needing to manage clusters. However, many data engineers and developers often need to use external Python packages that are not natively available in PySpark. In this blog, we will walk you through the steps to use external Python packages in a PySpark job on EMR Serverless.

We will use boto3 and Faker packages as examples, showing you how to package them in a virtual environment, upload them to S3, and configure your PySpark job to use these external libraries.

Video Guides

<TBD>

Create EMR Serverless Cluster with Version 7.0.0

Create Application and Copy Application ID on Notepad

Launch AWS CloudShell

First, start by accessing the AWS CloudShell, which is an in-browser terminal in the AWS Console. This allows you to interact with AWS services using the AWS CLI and perform tasks without having to set up your local machine.

I will showcase how to use Faker library in spark job we will generate dummy data and then write as parquet files lets create package and deploy package on S3

Alex Merced 1 个月前

Getting started with PySpark on Google Colab

Eduardo Miranda 2 个月前

Making Sense of Millions of Amazon Reviews Using SQL…

Soundarya Balasubramani 5 年前

Key Points

Replace <your-s3-bucket> with the name of your S3 bucket where the virtual environment will be stored.
The venv-pack utility simplifies the process of packaging the virtual environment into a portable zip file.

Step 2: Write a PySpark Job That Uses the External Libraries

Now that we’ve packaged and uploaded the virtual environment, we can write a simple PySpark job that utilizes the external libraries (boto3 and Faker). This example will generate synthetic data using Faker and save it to S3 in Parquet format.

Here’s a sample PySpark script: spark_job.py

Upload the Spark job to S3

Step 3: Submit the PySpark Job to EMR Serverless

Once the script is ready, it’s time to submit it to EMR Serverless using the AWS CLI. We need to ensure that the packaged virtual environment archive is available to the Spark job. To do this, we reference the S3 location where we uploaded the pyspark_venv.tar.gz and configure the necessary environment variables for Spark to use the Python interpreter inside the archive.

3.1. Run the PySpark Job on EMR Serverless

Use the following command to submit the job:

Explanation:

application-id: The ID of your EMR Serverless application.
entryPoint: The location of your PySpark script in S3.
spark.archives: This specifies the virtual environment archive we uploaded to S3. Spark unpacks this into the ./environment directory.
PYSPARK_PYTHON: Points to the Python executable inside the unpacked environment.

Exercise files

https://github.com/soumilshah1995/pyspark-emr-serverless-packages/blob/main/README.md

Conclusion

With this guide, you have successfully learned how to use external Python packages in a PySpark job on AWS EMR Serverless. By following these steps, you can install and package any Python libraries, upload them to S3, and configure your PySpark jobs to leverage those external dependencies. This allows you to extend the functionality of PySpark with additional libraries like Faker or boto3 for more advanced workflows.

By leveraging EMR Serverless, you avoid the overhead of managing clusters and only pay for the resources used during your job execution. Whether you are generating synthetic data, processing complex datasets, or interacting with other AWS services, using external Python packages in your PySpark jobs has never been easier.

要查看或添加评论，请登录

查看全部

How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

Create EMR Serverless Cluster with Version 7.0.0

Launch AWS CloudShell

领英推荐

Key Points

Step 2: Write a PySpark Job That Uses the External Libraries

Step 3: Submit the PySpark Job to EMR Serverless

3.1. Run the PySpark Job on EMR Serverless

Explanation:

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Understanding How Apache MLlib Empowers Scalable Machine Learning with Apache Spark

BigData Analytics with PySpark

Understanding the PySpark

Best Ways to Use Pandas with PySpark

Accessing Columns in PySpark: A Comprehensive Guide

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

PySpark

Dask vs Spark

JobTarget Internal Batch Framework that runs 5400 Jobs/Month 60,000 Jobs/Year on AWS Batch

Real-Time Sentiment Analysis with TCP Socket, Apache Spark, OpenAI, Kafka and Elasticsearch | Data Enginering pipeline project

Create EMR Serverless Cluster with Version 7.0.0

Launch AWS CloudShell

领英推荐

Key Points

Step 2: Write a PySpark Job That Uses the External Libraries

Step 3: Submit the PySpark Job to EMR Serverless

3.1. Run the PySpark Job on EMR Serverless

Explanation:

Conclusion

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

2024年11月24日

Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

2024年11月22日

Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

2024年11月21日

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

2024年11月17日

Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

2024年11月8日

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

2024年11月3日

Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

2024年10月26日

Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

2024年10月20日

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

2024年9月30日

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

2024年9月29日

社区洞察

其他会员也浏览了

Understanding How Apache MLlib Empowers Scalable Machine Learning with Apache Spark

BigData Analytics with PySpark

Understanding the PySpark

Best Ways to Use Pandas with PySpark

Accessing Columns in PySpark: A Comprehensive Guide

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

PySpark

Dask vs Spark

JobTarget Internal Batch Framework that runs 5400 Jobs/Month 60,000 Jobs/Year on AWS Batch

Real-Time Sentiment Analysis with TCP Socket, Apache Spark, OpenAI, Kafka and Elasticsearch | Data Enginering pipeline project