Running PySpark Locally with Docker Before Deploying on AWS Glue
Just wrapped up a cool project using AWS Glue with PySpark, all while leveraging Docker! ??
?? Here’s a sneak peek of the workflow:
Why Docker and Glue for Spark Code?
To test and execute Spark code locally, we run it inside a Docker container using the official AWS Glue Docker image. This is extremely helpful with Spark code because developing Spark code is more interactive. Running your Spark code directly on the cloud for testing can cost a lot of money and doesn’t make sense until the development is complete. Ideally, testing should happen in your local environment. Docker makes this possible, enabling a smoother development cycle without running up your cloud costs.
Let's walk through the Spark job glue-pyspark.py, which demonstrates this workflow in action.
Key Highlights: ?? Using Docker for easy environment setup and portability ?? Running Spark jobs within the AWS Glue ecosystem ?? Working with the AWS CLI for seamless profile and credentials management ?? Exposing Spark’s web UI on ports 4040 and 18080 for live job monitoring
Why this is awesome:
docker run -it -v ${env:USERPROFILE}\.aws:/home/glue_user/.aws -v ${WORKSPACE_LOCATION}:/home/glue_user/workspace/ `
-e AWS_PROFILE=default -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 `
--name glue_spark_submit amazon/aws-glue-libs:glue_libs_4.0.0_image_01 spark-submit /home/glue_user/workspace/${SCRIPT_FILE_NAME}