How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide

How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide

Introduction

Running PySpark jobs on AWS EMR Serverless provides a scalable and cost-efficient way to process large datasets without needing to manage clusters. However, many data engineers and developers often need to use external Python packages that are not natively available in PySpark. In this blog, we will walk you through the steps to use external Python packages in a PySpark job on EMR Serverless.

We will use boto3 and Faker packages as examples, showing you how to package them in a virtual environment, upload them to S3, and configure your PySpark job to use these external libraries.


Video Guides

<TBD>


Create EMR Serverless Cluster with Version 7.0.0


Create Application and Copy Application ID on Notepad


Launch AWS CloudShell

First, start by accessing the AWS CloudShell, which is an in-browser terminal in the AWS Console. This allows you to interact with AWS services using the AWS CLI and perform tasks without having to set up your local machine.


I will showcase how to use Faker library in spark job we will generate dummy data and then write as parquet files lets create package and deploy package on S3


Key Points

  • Replace <your-s3-bucket> with the name of your S3 bucket where the virtual environment will be stored.
  • The venv-pack utility simplifies the process of packaging the virtual environment into a portable zip file.

Step 2: Write a PySpark Job That Uses the External Libraries

Now that we’ve packaged and uploaded the virtual environment, we can write a simple PySpark job that utilizes the external libraries (boto3 and Faker). This example will generate synthetic data using Faker and save it to S3 in Parquet format.

Here’s a sample PySpark script: spark_job.py

Upload the Spark job to S3

Step 3: Submit the PySpark Job to EMR Serverless

Once the script is ready, it’s time to submit it to EMR Serverless using the AWS CLI. We need to ensure that the packaged virtual environment archive is available to the Spark job. To do this, we reference the S3 location where we uploaded the pyspark_venv.tar.gz and configure the necessary environment variables for Spark to use the Python interpreter inside the archive.

3.1. Run the PySpark Job on EMR Serverless

Use the following command to submit the job:


Explanation:

  • application-id: The ID of your EMR Serverless application.
  • entryPoint: The location of your PySpark script in S3.
  • spark.archives: This specifies the virtual environment archive we uploaded to S3. Spark unpacks this into the ./environment directory.
  • PYSPARK_PYTHON: Points to the Python executable inside the unpacked environment.

Exercise files

https://github.com/soumilshah1995/pyspark-emr-serverless-packages/blob/main/README.md

Conclusion

With this guide, you have successfully learned how to use external Python packages in a PySpark job on AWS EMR Serverless. By following these steps, you can install and package any Python libraries, upload them to S3, and configure your PySpark jobs to leverage those external dependencies. This allows you to extend the functionality of PySpark with additional libraries like Faker or boto3 for more advanced workflows.

By leveraging EMR Serverless, you avoid the overhead of managing clusters and only pay for the resources used during your job execution. Whether you are generating synthetic data, processing complex datasets, or interacting with other AWS services, using external Python packages in your PySpark jobs has never been easier.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了