登录查看更多内容

Build Glue (Spark) Streaming pipeline for clickstreams and power data lake with Apache Hudi and Query Real time with Athena

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2022年11月20日

+ 关注

Author

Soumil N Shah

Software Developer

I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as Software engineer at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

Project overview

Organizations often gather enormous amounts of data and continue to produce ever-increasing amounts of data, ranging from terabytes to petabytes and occasionally even exabytes. Such data is typically produced by various systems and needs to be gathered in one place for analysis and insight creation. Streaming data is data that is generated continuously by thousands of data sources, which typically produce data records at high velocity and in high volume. I'll demonstrate how to create a streaming ingestion pipeline, ingest and process streaming data using AWS glue, ingest into Apache Hudi tables, and query big data using tools like Athena in this article.

Keywords: data, Streaming, Glue, Apache Hudi?

Introduction

There is no single definition of “big data,” but in general it refers to data sets that are too large or complex for traditional data processing and analysis tools. Big data often includes data that is unstructured, meaning it does not fit neatly into traditional databases. Big data can come from sources such as social media, sensors, and transactional data. Organizations often gather enormous amounts of data and continue to produce ever-increasing amounts of data, ranging from terabytes to petabytes and occasionally even exabytes.?

Data that is continuously created from many sources, such as a website's log files or the page analytics on an e-commerce site, is known as stream data. The size and significance of this data increase at the same rate as system traffic. A scalable, cost-effective, and long-lasting streaming data solution is provided by Amazon Kinesis Data Streams. Tens of thousands of sources, including internet clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events can be used by Kinesis Data Streams to collect terabytes of data every second.

A data lake house is a data management architecture that uses a lake as a central data repository. The data lake house architecture includes three main components: the data lake, the data warehouse, and the data mart. The data lake is a central repository for all data, both structured and unstructured. The data warehouse is a centralized repository for all structured data. The data mart is a centralized repository for all unstructured data.

What is Apache Hudi and why we should use it for ?Data lakes?

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality. Hudi enables you to manage data at the record level in Amazon S3 data lakes to simplify Change Data Capture (CDC) and streaming data ingestion and helps to handle data privacy use cases requiring record-level updates and deletes. Data sets managed by Hudi are stored in S3 using open storage formats, while integrations with Presto, Apache Hive, Apache Spark, and AWS Glue Data Catalog give you near real-time access to updated data using familiar tools(Amazon)

Architecture :

领英推荐

Serverless Model Deployment in AWS: Streamlining with…

Jon Bonso 11 个月前

Seamless Data Streaming: How to Integrate Kafka with…

Reckonsys Tech Labs 2 个月前

Top 11 ML Infrastructure Tools

(Michael) Sebastian Metti 2 年前

Data that is continuously created from many sources, such as a website's log files or the page analytics on an e-commerce site, is known as stream data . users can publish data into kinesis using various methods such as through microservices, event bridge, or direct put object using boto3 client SDK. We will set the Kinesis Data Streams to Auto Scaling and we shall set the data retention period to 24 hours. Once the data is inserted into kinesis data streams we shall be processing the streaming data using Glue and after processing each batch we will insert apache Apache Hudi tables into our data lakes.

We shall be using Python Script to generate fake data and these fake data shall be inserted into kinesis data streams an AWS G. We created lue streaming ETL job to take this streaming data, which writes the ingested and transformed data to Amazon S3 using the Apache Hudi Connector for AWS Glue and also generates a table in the AWS Glue Data Catalog. Hudi partitions a dataset into a directory structure under a base path pointing to a location in Amazon S3 after the data has been ingested.

Hands on Lab with code

Special Thanks to Authors where i was able to learn and implement my own versions using server less framework (Build a serverless pipeline to analyze streaming data using AWS Glue, Apache Hudi, and Amazon S3)

Code

References

[Build a Serverless Pipeline to Analyze Streaming Data Using AWS Glue, Apache Hudi, and Amazon S3. aws.amazon.com/blogs/big-data/build-a-serverless-pipeline-to-analyze-streaming-data-using-aws-glue-apache-hudi-and-amazon-s3. Accessed 20 Nov. 2022.

“Writing to Apache Hudi Tables Using AWS Glue Custom Connector.” Amazon, aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector. Accessed 20 Nov. 2022.

“Amazon Kinesis Data Streams.” Medium, medium.com/@elifnurber/amazon-kinesis-data-streams-5e4f9a6b6176. Accessed 20 Nov. 2022.

Data Lakehouse: Building the Next Generation of Data Lakes Using Apache Hudi. medium.com/slalom-build/data-lakehouse-building-the-next-generation-of-data-lakes-using-apache-hudi-41550f62f5f. Accessed 20 Nov. 2022.

“Apache Hudi on Amazon EMR.” Amazon, aws.amazon.com/emr/features/hudi/#:~:text=Hudi%20enables%20you%20to%20manage,record%20level%20updates%20and%20deletes. Accessed 20 Nov. 2022.

“How NerdWallet Uses AWS and Apache Hudi to Build a Serverless, Real-time Analytics Platform.” Amazon, aws.amazon.com/blogs/big-data/how-nerdwallet-uses-aws-and-apache-hudi-to-build-a-serverless-real-time-analytics-platform. Accessed 20 Nov. 2022.

要查看或添加评论，请登录

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025年3月21日

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored in…
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025年3月13日

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

3 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论

See all articles

Build Glue (Spark) Streaming pipeline for clickstreams and power data lake with Apache Hudi and Query Real time with Athena

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

Soumil N Shah

Software Developer

领英推荐

Hands on Lab with code

Soumil S.的更多文章

社区洞察

其他会员也浏览了

Boost Real-time Processing with Spark Structured Streaming

The Serverless Wave is Approaching!

From Kubernetes to Generative AI: The Future of Work - Harnessing the Power of MongoDB Atlas

8 Easy Ways to Run Your Jupyter Notebook in the Cloud (2023 Update)

DATA Pill #026 - choose your cloud, leave the scrum and look at Tinder API Gateway

Serverless AI Systems. Multicloud Orchestration

AWS Weekly News Roundup Issue #200 ??

Spark Structured Streaming

From Monolith to Microservices: Reimagining Google Search Architecture

Soumil N Shah

Software Developer

领英推荐

Hands on Lab with code

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

社区洞察

其他会员也浏览了

Boost Real-time Processing with Spark Structured Streaming

The Serverless Wave is Approaching!

From Kubernetes to Generative AI: The Future of Work - Harnessing the Power of MongoDB Atlas

8 Easy Ways to Run Your Jupyter Notebook in the Cloud (2023 Update)

DATA Pill #026 - choose your cloud, leave the scrum and look at Tinder API Gateway

Serverless AI Systems. Multicloud Orchestration

AWS Weekly News Roundup Issue #200 ??

Spark Structured Streaming

From Monolith to Microservices: Reimagining Google Search Architecture