登录查看更多内容

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2023年5月8日

+ 关注

Author:

Soumil Nitin Shah

I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as a data collection and processing Team Lead at Jobtarget where I spent most of my time developing Ingestion Frameworks and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) and optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

April Love Ituhat (Software Engineer, Python)

I have a bachelor’s degree in computer engineering and have spent the last three years working on Research and development tasks?involving diverse domains such as AWS, Machine Learning, Robot simulations, and IoT. I've been a part of the JobTarget data team since November 2021, and I usually work with Python and AWS. It's exciting for me to see the applications come to fruition.

Divyansh Patel

I'm a highly skilled and motivated professional with a Master's degree in Computer Science and extensive experience in Data Engineering and AWS Cloud Engineering. I'm currently working with the renowned industry expert Soumil Shah and thrive on tackling complex problems and delivering innovative solutions. My passion for problem-solving and commitment to excellence enables me to make a positive impact on any project or team I work with. I look forward to connecting and collaborating with like-minded professionals.

Songda Lei

I'm a software engineer with three years of experience in front-end web and app development using React, Flutter, and Figma. I also have a strong foundation in back-end development with Node.js, SpringBoot, Gin, Golang, and MySQL. Additionally, I have experience with AWS Serverless technologies such as CDK, Lambda, DynamoDB, SQS, SNS, EventBridge, and Glue

Demo Video

Introduction

In today's data-driven world, organizations are collecting and analyzing large amounts of data to gain insights and make informed decisions. As a result, the need for efficient and scalable big data solutions has increased significantly. Data lakes have emerged as a popular solution for storing and managing large amounts of structured and unstructured data.

However, managing data lakes can be challenging, especially when it comes to ingesting and processing large volumes of data. Extract, Transform, Load (ETL) jobs are a critical component of data lake management. ETL jobs are used to extract data from various sources, transform it to fit the data lake schema, and load it into the data lake.

Apache Hudi is an open-source data management framework that provides features such as incremental data processing and data change management for data lakes. AWS Glue is a fully managed ETL service that can run ETL jobs on data stored in AWS services such as Amazon S3, Amazon DynamoDB, and Amazon RDS. By using Apache Hudi and AWS Glue together, organizations can create a powerful data lake solution. In this paper, we'll discuss how we've maximized efficiency in our data lake Glue ETL jobs with a templated approach and serverless architecture.

Project Architecture:

Our data lake solution consists of the following components:

Data Storage: We use Amazon S3 to store data in our data lake. Amazon S3 is a highly scalable and durable object storage service that provides high availability and fault tolerance.

Apache Hudi: We use Apache Hudi to manage incremental data processing and data change management in our data lake. Apache Hudi provides features such as DeltaStreamer and Compaction that make data processing and management more efficient.

AWS Glue: We use AWS Glue to run ETL jobs on data stored in our data lake. AWS Glue is a fully managed ETL service that can automatically discover and catalog data, generate ETL code, and execute ETL jobs.

SQL Transformer: We have also included a SQL-based transformer in our data lake solution that allows for easy data transformation by passing SQL queries as input payloads.

Lambda Function: We use a Lambda function to trigger Glue ETL jobs based on metadata read from a DynamoDB table. The Lambda function is triggered on a CRON schedule and reads the metadata to determine the appropriate parameters for the Glue ETL job.

DynamoDB: We use DynamoDB to store metadata for our Glue ETL jobs. The metadata includes job-specific parameters such as input path, output path, and configurations for each job.

API-Based Microservice: We use an API-based microservice hosted on ECS to allow developers to interact with Swagger UI and set up new jobs for tables easily.

Explanation:

Figure: Shows Sample tables in Raw zone in S3

Let's say you have raw tables that you want to ingest into a Hudi transactional data lake. Instead of writing separate jobs for each table, imagine having 100 tables and having to write 100 ETL jobs. Then, writing infrastructure code for all of them and managing them becomes challenging. That's where this framework comes in handy

Figure: Shows Swagger Ui for Creating Ingestion Jobs?

领英推荐

PySpark Introduction: Powering Big Data Processing…

Eduardo Miranda 7 个月前

Robust Architecture to populate Data from MongoDB in…

Soumil S. 3 年前

Understanding the PySpark

Sumit Joshi 1 年前

Payload to Create Jobs

Figure: Shows Sample Payload

The user can set up these tables without writing ETL code by creating an ingestion job in seconds. They can specify the job creator, schedule the job through an API, and specify the source target and transformation as an SQL query. In the payload, the user can specify the job's active status, creator, scheduling, job name, Lambda ARN, table name, and Glue payload. The Glue payload includes options such as enabling cleaner and Hive sync, partition fields, source and target S3 paths, and SQL transformer query for transformation.

Figure Shows Event Bridge Rule Created for Payload shown above

After the request is made through the API, an Event Bridge rule is created which will trigger the lambda job based on the user's CRON expression. The metadata about the job is stored in DynamoDB, as shown in the figure. The Event Bridge rule will have a primary key and sort key which will be passed to the lambda function at the time of the CRON expression. The lambda function will query DynamoDB to get all the job parameters and then fire the Glue job. User can also adopt to choose to fire job manually if needed.

Deep Dive Glue Template

Sample Payload to Set Jobs

Template Code

Define import

define spark session

Method to UPSERT into HUDI

This is a Python function that performs an upsert operation on a dataframe and writes the results to a Hudi table. The function takes several parameters, including the name of the Glue database, the name of the Hudi table, the name of the field in the dataframe that will be used as the record key, the Hudi table type (e.g., COPY_ON_WRITE, MERGE_ON_READ), the dataframe to upsert, and several boolean flags that control the behavior of the upsert operation (e.g., whether or not to enable partitioning or data cleaning).

The function first sets up a dictionary of settings that will be used when writing the dataframe to the Hudi table. This includes basic settings such as the name of the Hudi table and the record key field, as well as more advanced settings such as the compression codec to use and whether or not to sync with Hive.

The function then checks if any SQL transformations need to be applied to the dataframe before upserting it into the Hudi table. If so, it creates a temporary view of the dataframe and applies the specified SQL query to transform the data.

Finally, the function writes the dataframe to the target Hudi table using the specified settings and Hudi write method.

This code defines a function called read_data_s3 which reads data from an AWS S3 bucket and returns it as a Spark DataFrame. The function takes three arguments: path, format, and table_name.

The path argument is a string that represents the S3 bucket path where the data is stored. The format argument is a string that represents the file format of the data (e.g., parquet). The table_name argument is a string that represents the name of the glue table.

The function first checks if the file format is either "parquet" or "json". If it is, it creates a dynamic frame from the S3 bucket using AWS Glue. The dynamic frame is then converted to a Spark DataFrame and printed to show the first few rows of the DataFrame.

The code is designed to be easily extended for any file format that can be read by AWS Glue. However, this particular example only shows how to read JSON and parquet files.

Source Code

Demo Video

Conclusion:

In conclusion, we've discussed how we've maximized efficiency in our data lake Glue ETL jobs with a templated approach and serverless architecture. By using a templated approach, we've reduced the amount of infrastructure code required to manage our data lake. By using a Lambda function to trigger Glue ETL jobs based on metadata read from a DynamoDB table, we've minimized the amount of manual intervention required to manage our ETL jobs. By using a serverless architecture, we've minimized the operational overhead of managing infrastructure and reduced costs. Overall, this approach has helped us to streamline the process of ingesting new data into our data lake and manage our ETL jobs more efficiently.

Xesca Alabart ??

Coach & Tech | The ai scoping woman | Between London and Barcelona

1 年

thanks for sharing.

查看更多评论

要查看或添加评论，请登录

Soumil S.的更多文章

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

2025年3月25日

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

We’ll be diving into AWS Managed Iceberg and exploring the latest features of S3 table buckets. Gain hands-on…

4 条评论
Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025年3月21日

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored in…
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025年3月13日

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

3 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论

See all articles

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

领英推荐

Payload to Create Jobs

Soumil S.的更多文章

社区洞察

其他会员也浏览了

Understanding the PySpark

Introducing: MGraph-AI - A Memory-First Graph Database for GenAI and Serverless Apps

Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

Fully Automated Data Ingestion Pipeline (Ingest 1.2TB) To Elastic Search using AWS Step function and Lambda and Firehose

How we got 50X faster Speed for querying Data Lake using Athena Query & saved Thousands of dollars | Case Study

BigData Analytics with PySpark

Harnessing the Power of Elasticsearch: boosting your search capabilities

From a full-stack developer to full-stack data scientists

Unlocking Incremental Data in PySpark: Extracting from JDBC Sources without Debezium or AWS DMS with CDC

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

领英推荐

Payload to Create Jobs

Soumil S.的更多文章

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

社区洞察

其他会员也浏览了

Understanding the PySpark

Introducing: MGraph-AI - A Memory-First Graph Database for GenAI and Serverless Apps

Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

Fully Automated Data Ingestion Pipeline (Ingest 1.2TB) To Elastic Search using AWS Step function and Lambda and Firehose

How we got 50X faster Speed for querying Data Lake using Athena Query & saved Thousands of dollars | Case Study

BigData Analytics with PySpark

Harnessing the Power of Elasticsearch: boosting your search capabilities

From a full-stack developer to full-stack data scientists

Unlocking Incremental Data in PySpark: Extracting from JDBC Sources without Debezium or AWS DMS with CDC

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing