Develop scalable AWS Python jobs with Glue, Ray & the SDK for Pandas
Introduction
To my mind, one of the biggest announcements that came out of AWS re:Invent 2022, and which went largely unheralded, was the introduction of AWS Glue for Ray (in Preview). This is a new engine option for Glue version 4.0 and basically allows you to develop Python jobs that can process huge data sets via Ray & Modin. AWS Glue for Ray facilitates the distributed processing of your Python code over multi-node clusters.
Tied in with this was another announcement that awswrangler (now called AWS SDK for Pandas) also now supports?Ray and Modin, enabling you to scale your Pandas workflows from a single machine to a multi-node environment, with little to no code changes.
Glue for Ray in combination with SDK for Pandas is now a great option for any Python developers working with huge data sets in the AWS eco-system who, until now, felt pushed into using, say, Spark on EMR or even having to set up their own cluster on Fargate/ECS/Kubernetes to run Dask for example.
Using Glue interactive sessions, you can now leverage Ray datasets and/or familiar Python libraries such as Pandas, distributed by Modin, and the AWS SDK for Pandas, to make your workflows easy to write, run and scale.?
For more information about Ray datasets, see?Ray Datasets?in the Ray documentation. I also wrote a LinkedIn article on Ray a while back that should serve as a good general introduction. You can check that out here.
For more information about pandas, see the?Pandas website and for more information about Modin, see the?Modin website.
I’ll be using the SDK for Pandas in my example code for this article which is a Python library, developed by AWS, specifically to more easily facilitate getting data to/from Pandas data frames to/from various AWS services. Check out my LinkedIn article about AWS SDK for Pandas here for an introduction to it.
The two main ways you interact with Ray on Glue are both via the Glue Studio Console. One way is via the Ray Script Editor where you write a Python job script and the other is via a Glue Studio Notebook which presents you with a familiar Jupyter notebook experience if you prefer that. We’re going to be using the regular Script Editor.
Before we begin it’s important to repeat that Ray on Glue is in preview just now. I think it’s highly likely it will become GA in the very near future but you should certainly not use it for anything you want to put into production until that is confirmed.
Ok with that being said, let’s get started.
The data
The first thing we’ll need is some big data to process. For this, we can use one of AWS's open data sets that are made available for free on publicly available S3 locations. I’ll be using the famous NYC yellow taxi cab trip set, which is a bunch of data related to yellow taxi cab rides in New York.
You can find information on that data at:
There are a number of data fields available but the ones we are interested in are the fare_amount and tip_amount fields as we want to calculate the average tip New Yorkers give to yellow cab drivers as a percentage of the fare.
We’ll read 3 years' worth of data into a data frame, which equates to roughly 140 Million records (~ 25 Gbytes) ?then perform some computations on it. So, although it’s not a particularly huge set of data it’s certainly more data than could fit in the RAM of the average home laptop or PC. I did try and use more years of data than just the 3 I ended up using but found there were incompatibilities between some of the older data sets column types and the more recent ones which caused the mean calculations to fail.
The Glue setup
To start things off from the Glue side of things, first, navigate to the AWS Glue Studio Jobs service page from the AWS main console. It should look like this:
Choose the Ray Script Editor option and hit the Create button at the top right of the screen. You’ll then be presented with the normal Glue Job script editor shown below.
领英推荐
Before we write any code, go to the Job details TAB and enter a suitable name for your job. Next, enter a suitable IAM role. As a minimum, the IAM role should have Cloudwatch and S3 access rights. In the Type input field choose Ray and set the Number of workers input to 3 and the Retries to 0.
For the moment, you can’t change the Glue version, worker type, language, or job timeout length.
Each Z.2x Glue 4 worker type provides 8 vCPU’s and 64 GB of RAM so are well specified.
For Glue version 4.0, the Ray environment provides the following extra packages over and above the regular normal Glue environment:
?
The --additional-python-modules?parameter is also available if you need to add a new module or change the version of an existing module.?
In the Advance properties section, you can enter a script name and S3 storage location for your job, ?a maximum concurrency i.e how many instances of the same job can run at any one time, and a couple of other options which we don’t need just now.
The code
Writing code to take advantage of Ray using the SDK for Pandas is fairly straightforward especially if you already have experience using the SDK for Pandas. It’s just a matter of putting an import ray at the top of your code and everything else is pretty much the same.
Our python script is going to do just two main things:
Go back to the Script TAB of the Ray Script editor screen and type in the following code.
import ray
import awswrangler as wr
import pyarrow
import pandas as pd
import boto3
years=['2019','2020','2021']
months=['01','02','03','04','05','06','07','08','09','10','11','12']
files=[]
# setup the list of files we want to read
for yr in years:
???for mnth in months:
???????files.append(f"s3://nyc-tlc/trip data/yellow_tripdata_{yr}-{mnth}.parquet")
# read in the data
df = wr.s3.read_parquet(files)
# do some simple analytics
total_trips=df.count()
avg_fare = df['fare_amount'].mean()
avg_tip = df['tip_amount'].mean()
print(f"Total nr of trips between 2018-2021 was {total_trips}")?
print(f"Average fare amount between 2019-2021 was {avg_fare}")
print(f"Average tip amount between 2019-2021 was {avg_tip}")
print(f"Average tip as % of fare between 2019–2021 = {avg_tip/avg_fare*100}%")
Now hit the SAVE button on the top right-hand corner of the screen followed by the Run button.
Ray on Glue writes the outputs of its scripts to the Cloudwatch log group /aws-glue/ray/jobs/script-log so you can check there to ensure your script ran as expected.
?When I ran the script my main output records were:
Total nr of trips between 2019-2021 was?? 140151844
Average fare amount between 2019-2021 was 13.305165840058443
Average tip amount between 2019-2021 was 2.2043904113027573
Average tip as % of fare = 16.56792886162984%
New Yorkers are a generous bunch.
Ignoring the start-up time for Glue, the job took 42 seconds to read the data and perform the computations which isn’t too shabby considering the amount of data processed.
I upped the number of worker nodes from the default of 3 to 9 and re-ran the same script. The processing time dropped to 31 seconds which isn't great considering the hike in processing power made available to the job but is probably explained by the fact that the data set just wasn't big enough to tax the system enough to need to use all that extra capacity.
Ok, that’s all I have for now. If you found this article useful, please like?and share and help spread the knowledge around.
A great blog! Thanks for taking the time to write it. Abdel Jaidi, Anton Kukushkin, Japson Jeyasekaran