Build Glue (Spark) Streaming pipeline for clickstreams and power data lake with Apache Hudi and Query Real time with Athena

Build Glue (Spark) Streaming pipeline for clickstreams and power data lake with Apache Hudi and Query Real time with Athena

Author

Soumil N Shah

Software Developer

I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as Software engineer at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

Project overview

Organizations often gather enormous amounts of data and continue to produce ever-increasing amounts of data, ranging from terabytes to petabytes and occasionally even exabytes. Such data is typically produced by various systems and needs to be gathered in one place for analysis and insight creation. Streaming data is data that is generated continuously by thousands of data sources, which typically produce data records at high velocity and in high volume. I'll demonstrate how to create a streaming ingestion pipeline, ingest and process streaming data using AWS glue, ingest into Apache Hudi tables, and query big data using tools like Athena in this article.

Keywords: data, Streaming, Glue, Apache Hudi?

Introduction

There is no single definition of “big data,” but in general it refers to data sets that are too large or complex for traditional data processing and analysis tools. Big data often includes data that is unstructured, meaning it does not fit neatly into traditional databases. Big data can come from sources such as social media, sensors, and transactional data. Organizations often gather enormous amounts of data and continue to produce ever-increasing amounts of data, ranging from terabytes to petabytes and occasionally even exabytes.?

Data that is continuously created from many sources, such as a website's log files or the page analytics on an e-commerce site, is known as stream data. The size and significance of this data increase at the same rate as system traffic. A scalable, cost-effective, and long-lasting streaming data solution is provided by Amazon Kinesis Data Streams. Tens of thousands of sources, including internet clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events can be used by Kinesis Data Streams to collect terabytes of data every second.

A data lake house is a data management architecture that uses a lake as a central data repository. The data lake house architecture includes three main components: the data lake, the data warehouse, and the data mart. The data lake is a central repository for all data, both structured and unstructured. The data warehouse is a centralized repository for all structured data. The data mart is a centralized repository for all unstructured data.

What is Apache Hudi and why we should use it for ?Data lakes?

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality. Hudi enables you to manage data at the record level in Amazon S3 data lakes to simplify Change Data Capture (CDC) and streaming data ingestion and helps to handle data privacy use cases requiring record-level updates and deletes. Data sets managed by Hudi are stored in S3 using open storage formats, while integrations with Presto, Apache Hive, Apache Spark, and AWS Glue Data Catalog give you near real-time access to updated data using familiar tools(Amazon)

Architecture :

No alt text provided for this image

Data that is continuously created from many sources, such as a website's log files or the page analytics on an e-commerce site, is known as stream data . users can publish data into kinesis using various methods such as through microservices, event bridge, or direct put object using boto3 client SDK. We will set the Kinesis Data Streams to Auto Scaling and we shall set the data retention period to 24 hours. Once the data is inserted into kinesis data streams we shall be processing the streaming data using Glue and after processing each batch we will insert apache Apache Hudi tables into our data lakes.

We shall be using Python Script to generate fake data and these fake data shall be inserted into kinesis data streams an AWS G. We created lue streaming ETL job to take this streaming data, which writes the ingested and transformed data to Amazon S3 using the Apache Hudi Connector for AWS Glue and also generates a table in the AWS Glue Data Catalog. Hudi partitions a dataset into a directory structure under a base path pointing to a location in Amazon S3 after the data has been ingested.

Hands on Lab with code

Special Thanks to Authors where i was able to learn and implement my own versions using server less framework (Build a serverless pipeline to analyze streaming data using AWS Glue, Apache Hudi, and Amazon S3)

Code


References

[Build a Serverless Pipeline to Analyze Streaming Data Using AWS Glue, Apache Hudi, and Amazon S3. aws.amazon.com/blogs/big-data/build-a-serverless-pipeline-to-analyze-streaming-data-using-aws-glue-apache-hudi-and-amazon-s3. Accessed 20 Nov. 2022.

“Writing to Apache Hudi Tables Using AWS Glue Custom Connector.” Amazon, aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector. Accessed 20 Nov. 2022.

“Amazon Kinesis Data Streams.” Medium, medium.com/@elifnurber/amazon-kinesis-data-streams-5e4f9a6b6176. Accessed 20 Nov. 2022.

Data Lakehouse: Building the Next Generation of Data Lakes Using Apache Hudi. medium.com/slalom-build/data-lakehouse-building-the-next-generation-of-data-lakes-using-apache-hudi-41550f62f5f. Accessed 20 Nov. 2022.

“Apache Hudi on Amazon EMR.” Amazon, aws.amazon.com/emr/features/hudi/#:~:text=Hudi%20enables%20you%20to%20manage,record%20level%20updates%20and%20deletes. Accessed 20 Nov. 2022.

“How NerdWallet Uses AWS and Apache Hudi to Build a Serverless, Real-time Analytics Platform.” Amazon, aws.amazon.com/blogs/big-data/how-nerdwallet-uses-aws-and-apache-hudi-to-build-a-serverless-real-time-analytics-platform. Accessed 20 Nov. 2022.

要查看或添加评论,请登录

Soumil S.的更多文章

社区洞察

其他会员也浏览了