How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide
Introduction
Running PySpark jobs on AWS EMR Serverless provides a scalable and cost-efficient way to process large datasets without needing to manage clusters. However, many data engineers and developers often need to use external Python packages that are not natively available in PySpark. In this blog, we will walk you through the steps to use external Python packages in a PySpark job on EMR Serverless.
We will use boto3 and Faker packages as examples, showing you how to package them in a virtual environment, upload them to S3, and configure your PySpark job to use these external libraries.
Video Guides
<TBD>
Create EMR Serverless Cluster with Version 7.0.0
Create Application and Copy Application ID on Notepad
Launch AWS CloudShell
First, start by accessing the AWS CloudShell, which is an in-browser terminal in the AWS Console. This allows you to interact with AWS services using the AWS CLI and perform tasks without having to set up your local machine.
I will showcase how to use Faker library in spark job we will generate dummy data and then write as parquet files lets create package and deploy package on S3
领英推荐
Key Points
Step 2: Write a PySpark Job That Uses the External Libraries
Now that we’ve packaged and uploaded the virtual environment, we can write a simple PySpark job that utilizes the external libraries (boto3 and Faker). This example will generate synthetic data using Faker and save it to S3 in Parquet format.
Here’s a sample PySpark script: spark_job.py
Upload the Spark job to S3
Step 3: Submit the PySpark Job to EMR Serverless
Once the script is ready, it’s time to submit it to EMR Serverless using the AWS CLI. We need to ensure that the packaged virtual environment archive is available to the Spark job. To do this, we reference the S3 location where we uploaded the pyspark_venv.tar.gz and configure the necessary environment variables for Spark to use the Python interpreter inside the archive.
3.1. Run the PySpark Job on EMR Serverless
Use the following command to submit the job:
Explanation:
Exercise files
Conclusion
With this guide, you have successfully learned how to use external Python packages in a PySpark job on AWS EMR Serverless. By following these steps, you can install and package any Python libraries, upload them to S3, and configure your PySpark jobs to leverage those external dependencies. This allows you to extend the functionality of PySpark with additional libraries like Faker or boto3 for more advanced workflows.
By leveraging EMR Serverless, you avoid the overhead of managing clusters and only pay for the resources used during your job execution. Whether you are generating synthetic data, processing complex datasets, or interacting with other AWS services, using external Python packages in your PySpark jobs has never been easier.