登录查看更多内容

Harnessing AWS Glue for High-Volume JSON Processing ??

Sanchit Balchandani

Engineering Manager II at EPAM Systems

发布日期: 2023年5月26日

Introduction:

In today's data-driven world ??, managing and interpreting substantial amounts of data has become increasingly vital. We often encounter scenarios where JSON files flood our AWS S3 buckets, demanding efficient processing. In this blog, I will share a recent use case where I successfully used AWS Glue, Python, and Terraform(mainly to handle infra) to tackle the challenge of processing a deluge of small JSON files efficiently.

Background:

Data handling at scale can be an arduous task, especially when dealing with a continuous influx of JSON files. I was asked to process approximately 1 million small JSON files per hour which are landing into an s3 bucket! ??? These files needed to be transformed into larger, more manageable XML files for further analysis by another application. To address this demanding task, I leveraged the power of AWS Glue and other technologies to create a robust and scalable data processing pipeline.

Tech Stack:

Terraform for creating AWS Resources ???
Glue, S3, Systems Manager, Cloud Watch services from AWS ???
Python for writing the Glue Scripts ??

Overcoming the Small Files Problem with Glue:

The first hurdle I encountered was the infamous "Small Files Problem" within AWS Glue. This challenge arises when the Spark driver in AWS Glue is overwhelmed with list() method calls to S3, leading to memory exhaustion and, unfortunately, job failures. The first iteration of my solution ran for 6 hours and, you guessed it, failed due to a memory error. ??

To combat this issue, I devised a primary Glue job written in Python, leveraging PySpark and Boto3. This job merged multiple small JSON files into larger ones based on a configurable parameter, mitigating the Small Files Problem. It also optimized downstream processing by partitioning the merged files into date-based folders within the S3 bucket. I saved the files in the following format:

Eduardo Miranda 3 个月前

PySpark on AWS EMR: A Guide to Efficient ETL Processing

Coditation 1 年前

Robust Architecture to populate Data from MongoDB in…

Soumil S. 2 年前


YYYYMMDD/HHMM

This approach helped me run this partition job multiple times per day, partitioning files based on specific hours. This made the main transformation job, running at midnight, more efficient as it dealt with fewer files.

Transforming and Optimizing Data Processing:

The secondary, or main, Glue job was designed to initialize a Spark session and fetch essential parameters like batch size and date from the AWS Systems Manager (SSM) Parameter Store. This job processed the JSON files in batches using Spark's inherent capabilities. I used the popular Python library xml.etree.ElementTree to convert JSON data into the desired XML format. To optimize storage and write performance, the transformed data was compressed using gzip before being written back to the S3 bucket.

Here is the high-level architecture

The Fruitful Outcome:

The result? A super-efficient pipeline that processed approximately 70k files in less than 2.5 minutes! ?? This feat required 10 Data Processing Units (DPUs) in AWS Glue. However, the solution is scalable and can accommodate additional DPUs to meet future requirements. With the current setup, it can easily process 1 million JSON files in around 35 mins which is quite fast and seems scalable as well.

Key Takeaway: The Power of Synergistic Tools and Strategies

The experience of using AWS Glue, Python, and Terraform to handle large amounts of JSON data highlights the significance of using the right combination of tools and strategies to address big data challenges successfully. As data keeps growing rapidly, AWS Glue offers a reliable platform for creating strong data processing pipelines. Python's flexibility allows for intricate data transformations, and Terraform helps optimize infrastructure efficiency. By leveraging these synergistic tools and strategies, you can effectively tackle the complexities of big data processing. ?????

P.S. - Thanks to generative AI for the edits ;) and making my English look better. The use case is real though :D

Jitendra Shimpi

Data Engineer | Expert in AWS Glue, Lambda, PySpark, SQL & Snowflake | Building Scalable ETL Pipelines & Optimizing Big Data Workflows

1 年

It's really helpful. Thank you for sharing

Luke Tislow

Solving problems

1 年

Great post!

1 次回应

Sathish K V

Software Architect(Cloud/Apps and Data) with Project Management - PMP | 2x AWS Certified | 3x Java Oracle & DB Certified | 2x Unix certified

1 年

Well documented Sanchit Balchandani , also share user guide details solutions steps.

1 次回应

Mohit Suhane

Solution Architect at EPAM Systems

1 年

Love this, great work Sanchit Balchandani

1 次回应

Abhishek S.

Changemaker | Catalyst | Promoter | Enabler

1 年

Gajula Naresh Srikanth Kyasa : You might like reading this

1 次回应

查看更多评论

要查看或添加评论，请登录

Building Strong Bonds with Your Engineering Team(s)

2023年2月10日

Harnessing AWS Glue for High-Volume JSON Processing ??

Sanchit Balchandani

Engineering Manager II at EPAM Systems

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

CloudifyOps Mini-blog Series - Simple Scheduled Tasks for S3 using AWS Lambda Function and Amazon CloudWatch Event

Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

BigData Analytics with PySpark

Harnessing the Power of Elasticsearch: boosting your search capabilities

Understanding the PySpark

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

Building Transaction Datalake with Hudi and Glue PySpark (Insert| Read| Write| Update| Time Travel | Snapshots| Schema Evolution| Incremental Query)

领英推荐

Building Strong Bonds with Your Engineering Team(s)

2023年2月10日

社区洞察

其他会员也浏览了

CloudifyOps Mini-blog Series - Simple Scheduled Tasks for S3 using AWS Lambda Function and Amazon CloudWatch Event

Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

BigData Analytics with PySpark

Harnessing the Power of Elasticsearch: boosting your search capabilities

Understanding the PySpark

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

Building Transaction Datalake with Hudi and Glue PySpark (Insert| Read| Write| Update| Time Travel | Snapshots| Schema Evolution| Incremental Query)