LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

Author:

Soumil Nitin Shah

I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as a data collection and processing Team Lead at Jobtarget where I spent most of my time developing Ingestion Frameworks and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) and optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

April Love Ituhat (Software Engineer, Python)

I have a bachelor’s degree in computer engineering and have spent the last three years working on Research and development tasks?involving diverse domains such as AWS, Machine Learning, Robot simulations, and IoT. I've been a part of the JobTarget data team since November 2021, and I usually work with Python and AWS. It's exciting for me to see the applications come to fruition.

Divyansh Patel

I'm a highly skilled and motivated professional with a Master's degree in Computer Science and extensive experience in Data Engineering and AWS Cloud Engineering. I'm currently working with the renowned industry expert Soumil Shah and thrive on tackling complex problems and delivering innovative solutions. My passion for problem-solving and commitment to excellence enables me to make a positive impact on any project or team I work with. I look forward to connecting and collaborating with like-minded professionals.

Songda Lei

I'm a software engineer with three years of experience in front-end web and app development using React, Flutter, and Figma. I also have a strong foundation in back-end development with Node.js, SpringBoot, Gin, Golang, and MySQL. Additionally, I have experience with AWS Serverless technologies such as CDK, Lambda, DynamoDB, SQS, SNS, EventBridge, and Glue

Demo Video


Introduction

In today's data-driven world, organizations are collecting and analyzing large amounts of data to gain insights and make informed decisions. As a result, the need for efficient and scalable big data solutions has increased significantly. Data lakes have emerged as a popular solution for storing and managing large amounts of structured and unstructured data.

However, managing data lakes can be challenging, especially when it comes to ingesting and processing large volumes of data. Extract, Transform, Load (ETL) jobs are a critical component of data lake management. ETL jobs are used to extract data from various sources, transform it to fit the data lake schema, and load it into the data lake.

Apache Hudi is an open-source data management framework that provides features such as incremental data processing and data change management for data lakes. AWS Glue is a fully managed ETL service that can run ETL jobs on data stored in AWS services such as Amazon S3, Amazon DynamoDB, and Amazon RDS. By using Apache Hudi and AWS Glue together, organizations can create a powerful data lake solution. In this paper, we'll discuss how we've maximized efficiency in our data lake Glue ETL jobs with a templated approach and serverless architecture.

Project Architecture:

No alt text provided for this image

Our data lake solution consists of the following components:

Data Storage: We use Amazon S3 to store data in our data lake. Amazon S3 is a highly scalable and durable object storage service that provides high availability and fault tolerance.

Apache Hudi: We use Apache Hudi to manage incremental data processing and data change management in our data lake. Apache Hudi provides features such as DeltaStreamer and Compaction that make data processing and management more efficient.

AWS Glue: We use AWS Glue to run ETL jobs on data stored in our data lake. AWS Glue is a fully managed ETL service that can automatically discover and catalog data, generate ETL code, and execute ETL jobs.

SQL Transformer: We have also included a SQL-based transformer in our data lake solution that allows for easy data transformation by passing SQL queries as input payloads.

Lambda Function: We use a Lambda function to trigger Glue ETL jobs based on metadata read from a DynamoDB table. The Lambda function is triggered on a CRON schedule and reads the metadata to determine the appropriate parameters for the Glue ETL job.

DynamoDB: We use DynamoDB to store metadata for our Glue ETL jobs. The metadata includes job-specific parameters such as input path, output path, and configurations for each job.

API-Based Microservice: We use an API-based microservice hosted on ECS to allow developers to interact with Swagger UI and set up new jobs for tables easily.

Explanation:

No alt text provided for this image

Figure: Shows Sample tables in Raw zone in S3

Let's say you have raw tables that you want to ingest into a Hudi transactional data lake. Instead of writing separate jobs for each table, imagine having 100 tables and having to write 100 ETL jobs. Then, writing infrastructure code for all of them and managing them becomes challenging. That's where this framework comes in handy

No alt text provided for this image

Figure: Shows Swagger Ui for Creating Ingestion Jobs?

Payload to Create Jobs

No alt text provided for this image

Figure: Shows Sample Payload

The user can set up these tables without writing ETL code by creating an ingestion job in seconds. They can specify the job creator, schedule the job through an API, and specify the source target and transformation as an SQL query. In the payload, the user can specify the job's active status, creator, scheduling, job name, Lambda ARN, table name, and Glue payload. The Glue payload includes options such as enabling cleaner and Hive sync, partition fields, source and target S3 paths, and SQL transformer query for transformation.

No alt text provided for this image

Figure Shows Event Bridge Rule Created for Payload shown above

No alt text provided for this image

After the request is made through the API, an Event Bridge rule is created which will trigger the lambda job based on the user's CRON expression. The metadata about the job is stored in DynamoDB, as shown in the figure. The Event Bridge rule will have a primary key and sort key which will be passed to the lambda function at the time of the CRON expression. The lambda function will query DynamoDB to get all the job parameters and then fire the Glue job. User can also adopt to choose to fire job manually if needed.

Deep Dive Glue Template

Sample Payload to Set Jobs

No alt text provided for this image

Template Code

Define import

No alt text provided for this image

define spark session

No alt text provided for this image

Method to UPSERT into HUDI

No alt text provided for this image
No alt text provided for this image

This is a Python function that performs an upsert operation on a dataframe and writes the results to a Hudi table. The function takes several parameters, including the name of the Glue database, the name of the Hudi table, the name of the field in the dataframe that will be used as the record key, the Hudi table type (e.g., COPY_ON_WRITE, MERGE_ON_READ), the dataframe to upsert, and several boolean flags that control the behavior of the upsert operation (e.g., whether or not to enable partitioning or data cleaning).

The function first sets up a dictionary of settings that will be used when writing the dataframe to the Hudi table. This includes basic settings such as the name of the Hudi table and the record key field, as well as more advanced settings such as the compression codec to use and whether or not to sync with Hive.

The function then checks if any SQL transformations need to be applied to the dataframe before upserting it into the Hudi table. If so, it creates a temporary view of the dataframe and applies the specified SQL query to transform the data.

Finally, the function writes the dataframe to the target Hudi table using the specified settings and Hudi write method.

No alt text provided for this image

This code defines a function called read_data_s3 which reads data from an AWS S3 bucket and returns it as a Spark DataFrame. The function takes three arguments: path, format, and table_name.

The path argument is a string that represents the S3 bucket path where the data is stored. The format argument is a string that represents the file format of the data (e.g., parquet). The table_name argument is a string that represents the name of the glue table.

The function first checks if the file format is either "parquet" or "json". If it is, it creates a dynamic frame from the S3 bucket using AWS Glue. The dynamic frame is then converted to a Spark DataFrame and printed to show the first few rows of the DataFrame.

The code is designed to be easily extended for any file format that can be read by AWS Glue. However, this particular example only shows how to read JSON and parquet files.

Source Code

No alt text provided for this image

Demo Video

No alt text provided for this image

Conclusion:

In conclusion, we've discussed how we've maximized efficiency in our data lake Glue ETL jobs with a templated approach and serverless architecture. By using a templated approach, we've reduced the amount of infrastructure code required to manage our data lake. By using a Lambda function to trigger Glue ETL jobs based on metadata read from a DynamoDB table, we've minimized the amount of manual intervention required to manage our ETL jobs. By using a serverless architecture, we've minimized the operational overhead of managing infrastructure and reduced costs. Overall, this approach has helped us to streamline the process of ingesting new data into our data lake and manage our ETL jobs more efficiently.

Xesca Alabart ??

Coach & Tech | The ai scoping woman | Between London and Barcelona

1 年

thanks for sharing.

回复

要查看或添加评论,请登录

Soumil S.的更多文章

社区洞察

其他会员也浏览了