登录查看更多内容

Running PySpark Locally with Docker Before Deploying on AWS Glue

Jaswanth Kumar

Data Engineer @ InfoMagnus | Snowflake, DBT, AWS

发布日期: 2024年10月4日

+ 关注

Just wrapped up a cool project using AWS Glue with PySpark, all while leveraging Docker! ??

?? Here’s a sneak peek of the workflow:

Why Docker and Glue for Spark Code?

To test and execute Spark code locally, we run it inside a Docker container using the official AWS Glue Docker image. This is extremely helpful with Spark code because developing Spark code is more interactive. Running your Spark code directly on the cloud for testing can cost a lot of money and doesn’t make sense until the development is complete. Ideally, testing should happen in your local environment. Docker makes this possible, enabling a smoother development cycle without running up your cloud costs.

Let's walk through the Spark job glue-pyspark.py, which demonstrates this workflow in action.

Key Highlights: ?? Using Docker for easy environment setup and portability ?? Running Spark jobs within the AWS Glue ecosystem ?? Working with the AWS CLI for seamless profile and credentials management ?? Exposing Spark’s web UI on ports 4040 and 18080 for live job monitoring

Why this is awesome:

Simplifies the development environment ???
Enables robust data processing with the power of PySpark and AWS Glue ??
Effortless containerization for data engineers and developers ??

docker run -it -v ${env:USERPROFILE}\.aws:/home/glue_user/.aws -v ${WORKSPACE_LOCATION}:/home/glue_user/workspace/ `
-e AWS_PROFILE=default -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 `
--name glue_spark_submit amazon/aws-glue-libs:glue_libs_4.0.0_image_01 spark-submit /home/glue_user/workspace/${SCRIPT_FILE_NAME}

要查看或添加评论，请登录

Jaswanth Kumar的更多文章

?? Dynamic Schema Evolution in Snowflake – A Practical Example

2024年12月12日

?? Dynamic Schema Evolution in Snowflake – A Practical Example

In today's fast-paced data environments, schemas change frequently, especially when working with files like CSVs…
Automating Schema Inference and Table Creation in Snowflake Using Staged Files

2024年10月30日

Automating Schema Inference and Table Creation in Snowflake Using Staged Files

Snowflake offers powerful features for dynamically creating tables by inferring schema directly from staged files. This…
Snowflake Regular Expressions for Effective Validation

2024年10月24日

Snowflake Regular Expressions for Effective Validation

When working with user data, email validation is crucial to ensure clean, consistent, and valid information. Recently…
Dealing with Column Count Mismatches in Snowflake: A Practical Approach Using ERROR_ON_COLUMN_COUNT_MISMATCH

2024年10月22日

Dealing with Column Count Mismatches in Snowflake: A Practical Approach Using ERROR_ON_COLUMN_COUNT_MISMATCH

When working with CSV files in Snowflake, especially when dealing with large and unstructured datasets, one common…
Automating Data Import from MySQL to HDFS Using Sqoop

2024年10月7日

Automating Data Import from MySQL to HDFS Using Sqoop

It's essential to efficiently transfer data from various sources into Hadoop's distributed storage. Sqoop…

See all articles

Running PySpark Locally with Docker Before Deploying on AWS Glue

Jaswanth Kumar

Data Engineer @ InfoMagnus | Snowflake, DBT, AWS

Jaswanth Kumar的更多文章

社区洞察

其他会员也浏览了

Using AWS Distro for OpenTelemetry with Jaeger

Azure Durable Functions

Microsoft Fabric Data Engineering - To infinity and beyond

MidLink Newsletter

Databricks

CREATE YOUR FIRST CICD PIPELINE

Leveraging Azure Functions for Efficient Serverless Computing: A Business Perspective

How Customers and Companies Can Use Hudi 1.0.0 on EMR Serverless | Developer Guide

Is Kafka good for only big data streaming and Event-driven systems should only use Azure Service Bus?

#56 Connecting the app to AWS S3 bucket

Jaswanth Kumar的更多文章

?? Dynamic Schema Evolution in Snowflake – A Practical Example

Automating Schema Inference and Table Creation in Snowflake Using Staged Files

Snowflake Regular Expressions for Effective Validation

Dealing with Column Count Mismatches in Snowflake: A Practical Approach Using ERROR_ON_COLUMN_COUNT_MISMATCH

Automating Data Import from MySQL to HDFS Using Sqoop

社区洞察

其他会员也浏览了

Using AWS Distro for OpenTelemetry with Jaeger

Azure Durable Functions

Microsoft Fabric Data Engineering - To infinity and beyond

MidLink Newsletter

Databricks

CREATE YOUR FIRST CICD PIPELINE

Leveraging Azure Functions for Efficient Serverless Computing: A Business Perspective

How Customers and Companies Can Use Hudi 1.0.0 on EMR Serverless | Developer Guide

Is Kafka good for only big data streaming and Event-driven systems should only use Azure Service Bus?

#56 Connecting the app to AWS S3 bucket