登录查看更多内容

Python Template: Incrementally Read S3 Objects from SQS Queue as Spark DataFrame | Hands on Labs

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2025年1月19日

Modern data lakehouse architectures demand efficient, scalable, and cost-effective ways to process and manage large data volumes. This guide introduces a Python-based solution that enables incremental ingestion of S3 objects into your data lakehouse using Spark, with support for table formats like Hudi, Iceberg, and Delta Lake.

Overview

This solution offers an end-to-end pipeline to process new files in S3 incrementally. It leverages AWS S3 event notifications, SQS for buffering, and Spark for processing the data as DataFrames. The processed data can be written to various table formats, including Hudi, Iceberg, or Delta Lake.

Video Guide

Key Features

Cost Efficiency: Reduces overhead by avoiding frequent S3 list() operations.
Scalability: Handles large data volumes seamlessly using AWS services.
Error Handling: Supports Dead Letter Queues (DLQs) for failed events.
Flexibility: Compatible with AWS Lambda, Glue, EMR on EC2/EKS, or EMR Serverless.

Step-by-Step Implementation

Step 1: Setup S3 Events and SQS Queue

Begin by configuring S3 to send event notifications to an SQS queue upon new file uploads. Use the provided Python script to automate the setup:

Script

This script creates the necessary S3 event notifications and SQS queue. Once configured, any new files uploaded to the specified S3 bucket will generate messages in the queue.

Step 2: Configure the Consumer

The consumer polls the SQS queue for new messages, retrieves the S3 URIs of the uploaded files, and processes them using Spark. Configure the consumer using the JSON file:

领英推荐

Python Based Data Workloads with Nessie and Apache…

Alex Merced 9 个月前

The Ultimate Guide to Data Analytics Tools: Python, R,…

PFES 9 个月前

A Compilation of my articles on various Data…

Parul Pandey 6 年前

Adjust parameters like queue_url, poll_interval, and batch_size to suit your requirements.

Step 3: Run the Consumer

Once configured, start the consumer to process incoming SQS events and load S3 files into Spark as DataFrames:

Key Functions Explained

main Function The main function orchestrates the ETL workflow:

Initializes the SQS poller to fetch messages.
Creates a Spark session for data processing.
Continuously polls for new messages, processes them using the process_batch function, and deletes processed messages.
Includes retry logic and periodic waits for efficient resource utilization.

process_batch Function The process_batch function processes a batch of SQS messages:

Extracts S3 file paths from messages.
Loads files into Spark as DataFrames.
Writes processed data into the target table format (Hudi, Iceberg, or Delta Lake).
Logs metrics and handles errors gracefully.

Benefits of the Approach

Cost Optimization: Avoids expensive S3 list() calls by using event-driven processing.
Scalability: Supports massive data ingestion workflows with minimal overhead.
Error Management: Ensures data integrity through DLQs for failed events.
Flexibility: Compatible with multiple AWS services and open-source table formats.

Conclusion

This Python template simplifies the process of ingesting data incrementally from S3 into your Lakehouse architecture. By combining S3 event notifications, SQS for message buffering, and Spark for data processing, you can create a robust, efficient, and scalable pipeline that supports modern table formats like Iceberg, Hudi, and Delta Lake.

For more details, explore the GitHub repository.

要查看或添加评论，请登录

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025年3月21日

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored in…
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025年3月13日

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

3 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论

See all articles

Python Template: Incrementally Read S3 Objects from SQS Queue as Spark DataFrame | Hands on Labs

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

Overview

Key Features

Step-by-Step Implementation

Step 1: Setup S3 Events and SQS Queue

Step 2: Configure the Consumer

领英推荐

Step 3: Run the Consumer

Key Functions Explained

Benefits of the Approach

Conclusion

Soumil S.的更多文章

社区洞察

其他会员也浏览了

How Can You Build Efficient Data Pipelines with Python?

Automating Flight Data Processing with Apache Airflow, Docker, and Python

Mastering Python for Data Engineering: Tools, Techniques, and Real-World Use Cases

Handling Big Data with Python

What are the benefits of using PySpark for Data Analysis?

Data Engineering in Action: Real-World Use Cases with Python

MI - ETLx: Incremental Extract and Load Module for Python

Marvelous MLOps #50: Dealing with private Python packages in Databricks Asset Bundles, part 1.

Essential Programming Languages for Data Engineering: Python, PySpark, and SQL

Understanding DataFrames in Python and PySpark

Overview

Key Features

Step-by-Step Implementation

Step 1: Setup S3 Events and SQS Queue

Step 2: Configure the Consumer

领英推荐

Step 3: Run the Consumer

Key Functions Explained

Benefits of the Approach

Conclusion

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

社区洞察

其他会员也浏览了

How Can You Build Efficient Data Pipelines with Python?

Automating Flight Data Processing with Apache Airflow, Docker, and Python

Mastering Python for Data Engineering: Tools, Techniques, and Real-World Use Cases

Handling Big Data with Python

What are the benefits of using PySpark for Data Analysis?

Data Engineering in Action: Real-World Use Cases with Python

MI - ETLx: Incremental Extract and Load Module for Python

Marvelous MLOps #50: Dealing with private Python packages in Databricks Asset Bundles, part 1.

Essential Programming Languages for Data Engineering: Python, PySpark, and SQL

Understanding DataFrames in Python and PySpark