登录查看更多内容

Develop scalable AWS Python jobs with Glue, Ray & the SDK for Pandas

Tom Reid

Data Engineer

发布日期: 2023年1月19日

Introduction

To my mind, one of the biggest announcements that came out of AWS re:Invent 2022, and which went largely unheralded, was the introduction of AWS Glue for Ray (in Preview). This is a new engine option for Glue version 4.0 and basically allows you to develop Python jobs that can process huge data sets via Ray & Modin. AWS Glue for Ray facilitates the distributed processing of your Python code over multi-node clusters.

Tied in with this was another announcement that awswrangler (now called AWS SDK for Pandas) also now supports?Ray and Modin, enabling you to scale your Pandas workflows from a single machine to a multi-node environment, with little to no code changes.

Glue for Ray in combination with SDK for Pandas is now a great option for any Python developers working with huge data sets in the AWS eco-system who, until now, felt pushed into using, say, Spark on EMR or even having to set up their own cluster on Fargate/ECS/Kubernetes to run Dask for example.

Using Glue interactive sessions, you can now leverage Ray datasets and/or familiar Python libraries such as Pandas, distributed by Modin, and the AWS SDK for Pandas, to make your workflows easy to write, run and scale.?

For more information about Ray datasets, see?Ray Datasets?in the Ray documentation. I also wrote a LinkedIn article on Ray a while back that should serve as a good general introduction. You can check that out here.

For more information about pandas, see the?Pandas website and for more information about Modin, see the?Modin website.

I’ll be using the SDK for Pandas in my example code for this article which is a Python library, developed by AWS, specifically to more easily facilitate getting data to/from Pandas data frames to/from various AWS services. Check out my LinkedIn article about AWS SDK for Pandas here for an introduction to it.

The two main ways you interact with Ray on Glue are both via the Glue Studio Console. One way is via the Ray Script Editor where you write a Python job script and the other is via a Glue Studio Notebook which presents you with a familiar Jupyter notebook experience if you prefer that. We’re going to be using the regular Script Editor.

Before we begin it’s important to repeat that Ray on Glue is in preview just now. I think it’s highly likely it will become GA in the very near future but you should certainly not use it for anything you want to put into production until that is confirmed.

Ok with that being said, let’s get started.

The data

The first thing we’ll need is some big data to process. For this, we can use one of AWS's open data sets that are made available for free on publicly available S3 locations. I’ll be using the famous NYC yellow taxi cab trip set, which is a bunch of data related to yellow taxi cab rides in New York.

You can find information on that data at:

https://registry.opendata.aws/nyc-tlc-trip-records-pds/

There are a number of data fields available but the ones we are interested in are the fare_amount and tip_amount fields as we want to calculate the average tip New Yorkers give to yellow cab drivers as a percentage of the fare.

We’ll read 3 years' worth of data into a data frame, which equates to roughly 140 Million records (~ 25 Gbytes) ?then perform some computations on it. So, although it’s not a particularly huge set of data it’s certainly more data than could fit in the RAM of the average home laptop or PC. I did try and use more years of data than just the 3 I ended up using but found there were incompatibilities between some of the older data sets column types and the more recent ones which caused the mean calculations to fail.

The Glue setup

To start things off from the Glue side of things, first, navigate to the AWS Glue Studio Jobs service page from the AWS main console. It should look like this:

Choose the Ray Script Editor option and hit the Create button at the top right of the screen. You’ll then be presented with the normal Glue Job script editor shown below.

领英推荐

Exploring the Top 10 Python Libraries in 2024

Blockchain Council 1 年前

Wider horizons: New tools (and languages) for Python…

InfoWorld 6 个月前

How Can We Create a Snowflake Python Worksheet?

Factspan 8 个月前

Before we write any code, go to the Job details TAB and enter a suitable name for your job. Next, enter a suitable IAM role. As a minimum, the IAM role should have Cloudwatch and S3 access rights. In the Type input field choose Ray and set the Number of workers input to 3 and the Retries to 0.

For the moment, you can’t change the Glue version, worker type, language, or job timeout length.

Each Z.2x Glue 4 worker type provides 8 vCPU’s and 64 GB of RAM so are well specified.

For Glue version 4.0, the Ray environment provides the following extra packages over and above the regular normal Glue environment:

awswrangler[modin,ray]==3.0.0rc2
ray == 2.0.0
pyarrow == 6.0.1
pandas == 1.5.0
modin == 0.16.1
dask == 2022.7.1
pymars == 0.9.0
numpy == 1.23.4

The --additional-python-modules?parameter is also available if you need to add a new module or change the version of an existing module.?

In the Advance properties section, you can enter a script name and S3 storage location for your job, ?a maximum concurrency i.e how many instances of the same job can run at any one time, and a couple of other options which we don’t need just now.

The code

Writing code to take advantage of Ray using the SDK for Pandas is fairly straightforward especially if you already have experience using the SDK for Pandas. It’s just a matter of putting an import ray at the top of your code and everything else is pretty much the same.

Our python script is going to do just two main things:

Read 3 years of parquet formatted NYC yellow cab ride information into a data frame
Perform some simple aggregations and computations on the data to get the average tip left for drivers as a percentage of the average fare amount.

Go back to the Script TAB of the Ray Script editor screen and type in the following code.

import ray

import awswrangler as wr
import pyarrow
import pandas as pd
import boto3

years=['2019','2020','2021']
months=['01','02','03','04','05','06','07','08','09','10','11','12']
files=[]

# setup the list of files we want to read
for yr in years:
???for mnth in months:
???????files.append(f"s3://nyc-tlc/trip data/yellow_tripdata_{yr}-{mnth}.parquet")

# read in the data
df = wr.s3.read_parquet(files)

# do some simple analytics
total_trips=df.count()
avg_fare = df['fare_amount'].mean()
avg_tip = df['tip_amount'].mean()

print(f"Total nr of trips between 2018-2021 was {total_trips}")?
print(f"Average fare amount between 2019-2021 was {avg_fare}")
print(f"Average tip amount between 2019-2021 was {avg_tip}")
print(f"Average tip as % of fare between 2019–2021 = {avg_tip/avg_fare*100}%")

Now hit the SAVE button on the top right-hand corner of the screen followed by the Run button.

Ray on Glue writes the outputs of its scripts to the Cloudwatch log group /aws-glue/ray/jobs/script-log so you can check there to ensure your script ran as expected.

?When I ran the script my main output records were:

Total nr of trips between 2019-2021 was?? 140151844
Average fare amount between 2019-2021 was 13.305165840058443
Average tip amount between 2019-2021 was 2.2043904113027573
Average tip as % of fare = 16.56792886162984%

New Yorkers are a generous bunch.

Ignoring the start-up time for Glue, the job took 42 seconds to read the data and perform the computations which isn’t too shabby considering the amount of data processed.

I upped the number of worker nodes from the default of 3 to 9 and re-ran the same script. The processing time dropped to 31 seconds which isn't great considering the hike in processing power made available to the job but is probably explained by the fact that the data set just wasn't big enough to tax the system enough to need to use all that extra capacity.

Ok, that’s all I have for now. If you found this article useful, please like?and share and help spread the knowledge around.

Nick Corbett

2 年

A great blog! Thanks for taking the time to write it. Abdel Jaidi, Anton Kukushkin, Japson Jeyasekaran

3 次回应

查看更多评论

要查看或添加评论，请登录

Tom Reid的更多文章

Free ebook

2024年6月27日

Free ebook

I’ve put an ebook I wrote around seven years ago onto one of my GitHub repos that anyone can download for free. The…
OpenAI GPT-4o vision capabilities are good

2024年5月18日

OpenAI GPT-4o vision capabilities are good

I hand-drew the above image, then fed it to openAI's new GPT-4o model via its API with a few lines of Python code. This…
Altman talks GPT-5

2024年2月14日

Altman talks GPT-5

Sam Altman recently appeared as a guest at the World Governments Summit, held in the UAE on the 12th-14th of February…
Using Partitioning and Bucketing to improve query performance and reduce costs in AWS Athena

2023年1月4日

Using Partitioning and Bucketing to improve query performance and reduce costs in AWS Athena

Partitioning and bucketing are two techniques that can help improve the performance of queries in Athena, Amazon's…
A first look at ChatGPT

2022年12月8日

A first look at ChatGPT

I wrote a brief LinkedIn post earlier today on ChatGPT, saying how clever I thought it was. Having played around with…
Run scalable, cost-effective workloads on AWS in parallel using AWS Batch and Step Functions

2022年12月5日

Run scalable, cost-effective workloads on AWS in parallel using AWS Batch and Step Functions

In this article, I'll discuss one of the ways you can run scalable and cost-effective computing workloads on AWS in…

1 条评论
Avoid using apply() on big dataframes

2022年11月27日

Avoid using apply() on big dataframes

I've been using Pandas for a while now, but recently I learned something new which is, if you're dealing with large…
Running geospatial queries on AWS Athena

2022年9月7日

Running geospatial queries on AWS Athena

I have been a user of AWS Athena for some time but didn’t know until recently that Athena possesses some nifty built-in…

3 条评论
AWS EMR Serverless

2022年7月5日

AWS EMR Serverless

On June 1st 2022 AWS announced the general availability of serverless Elastic Map Reduce (EMR). Amazon EMR is a cloud…
Apache Iceberg tables in AWS Athena

2022年5月14日

Apache Iceberg tables in AWS Athena

Iceberg tables In a recent announcement, AWS introduced the general availability of a new type of table to their…

1 条评论

See all articles

Develop scalable AWS Python jobs with Glue, Ray & the SDK for Pandas

Tom Reid

Data Engineer

领英推荐

Tom Reid的更多文章

社区洞察

其他会员也浏览了

What are the Reasons behind Increasing Demand for Python?

Data Analysis with Python: Tackling Challenges with NumPy

Snowflake Connector for Python

SnowPark Python— Aamir P

Pandas Library in Python

Introduction to Python Polars ????: A High-Efficiency DataFrames Built to Scale

ParityVend Releases Free Open-Source Python Library for Smart Pricing

Bulk Boto3 (bulkboto3): Python package for fast and parallel transferring a bulk of files to S3!

Unlearning to Learn: A Developer's Journey into Python

Package Managers in python part-3

领英推荐

Tom Reid的更多文章

Free ebook

OpenAI GPT-4o vision capabilities are good

Altman talks GPT-5

Using Partitioning and Bucketing to improve query performance and reduce costs in AWS Athena

A first look at ChatGPT

Run scalable, cost-effective workloads on AWS in parallel using AWS Batch and Step Functions

Avoid using apply() on big dataframes

Running geospatial queries on AWS Athena

AWS EMR Serverless

Apache Iceberg tables in AWS Athena

社区洞察

其他会员也浏览了

What are the Reasons behind Increasing Demand for Python?

Data Analysis with Python: Tackling Challenges with NumPy

Snowflake Connector for Python

SnowPark Python— Aamir P

Pandas Library in Python

Introduction to Python Polars ????: A High-Efficiency DataFrames Built to Scale

ParityVend Releases Free Open-Source Python Library for Smart Pricing

Bulk Boto3 (bulkboto3): Python package for fast and parallel transferring a bulk of files to S3!

Unlearning to Learn: A Developer's Journey into Python

Package Managers in python part-3