登录查看更多内容

Fully Automated Data Ingestion Pipeline (Ingest 1.2TB) To Elastic Search using AWS Step function and Lambda and Firehose

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2022年5月28日

Soumil Nitin Shah (Data collection and Processing Team lead)

I have a bachelor's degree in electronics engineering and a master's degree in electrical and computer engineering. In Python, I have a lot of experience designing scalable and high-performance software systems. I teach people about Data Science, Machine Learning, Elastic Search, and AWS on my YouTube channel. I works as the Data Collection and Processing Team Lead, and I spend the most of my time on AWS designing Ingestion Framework and microservices and scalable architecture. I've worked with a large quantity of data, including establishing data lakes (1.2T) and improving data lakes queries by partitioning and utilizing the appropriate file format and compression. I've also created and worked on a streaming application that uses kinesis to consume real-time stream data. I've worked on and maintained a production Elastic Search cluster with about 2TB of data, resulting in a 55-fold increase in search response time. I created a high-availability microservice with a multi-region Active-Active Backend using Route 53 and DynamoDB global tables, as well as an API Gateway. On a daily basis, I work with AWS lambda, SQS queue, Event Bridge, AWS batch, SNS, ECS, Open Search, DynamoDB, RDS, Route 53, and many more AWS services.

Here are some popular Articles I have published?

Batch framework (An Internal Data Ingestion Framework that process 1TB of data in Month and run 200+ Jobs)

Link: https://www.dhirubhai.net/pulse/batch-frameworkan-internal-data-ingestion-framework-process-shah/

How we got 50X faster Speed for querying Data Lake using Athena Query & saved Thousands of dollars | Case Study

Link https://www.dhirubhai.net/pulse/how-we-got-50x-faster-speed-querying-data-lake-using-athena-shah/

Elastic Search Performance Tuning and Optimization How We Got 80X Faster Searches a Case Study

https://www.dhirubhai.net/pulse/elastic-search-performance-tuning-optimization-how-we-soumil-shah/

Birendra Singh (Python Developer | Search Engineer | Data Specialist)

Hi, I am Birendra Singh I have completed Bachelor in Electronic Engineering. I love and enjoy working on an elastic search I have 6 years of professional experience in software development lifecycle spanning across multiple sectors including Telecommunications, Geospatial, Oil, and Gas with a focus on quality and on-time delivery

Hari Om Dubey (Consultant Software Engineer, Python developer)

I have completed a Master’s in Computer Application and I have 5 years of experience in developing software applications using Python and Django framework. I love to code in Python and creating a solution for a problem by coding excites me. I have been working at Jobtarget for like past 2 months as a Software Engineer in a Data Team.

Project Summary:

We regularly receive many files, about 7800 GZ files. Each GZ file has around 100000 records. Each file must be read, and pre-processing must be completed. Because these files are large, processing them takes longer, resulting in a bottleneck. We have an average of 100 million records. This is a labor-intensive and time-consuming task. We used to have a large codebase that read the file, processed it, and then uploaded it in bulk to elastic search, which took 5-7 days. I didn't like how tedious and labor-intensive this operation was, so we decided to create a Fully Automated Pipeline that could load all of this data in a fraction of the time. Hence we chose serverless components

Tech Stack:

·??????AWS Step Functions

·??????AWS Lambdas

·??????AWS Kinesis Data Streams

·??????AWS Firehose

·??????Open Search

·??????API Gateway

·??????AWS S3

Why Serverless?

Serverless is a cloud native design that allows you to delegate more of your operational tasks to AWS, resulting in increased agility and creativity. Serverless computing allows you to create and execute apps and services without having to worry about servers. Infrastructure management duties including server or cluster provisioning, patching, operating system maintenance, and capacity provisioning are all eliminated. You may create them for almost any form of application or backend service, and they take care of everything you need to run and scale your app with high availability (Ashish Patel. “AWS — Serverless Services on AWS.” medium. Accessed May 27, 2022)

Architecture:

Figure 1: Shows High-level Architecture Block?

Figure 2: Shows AWS Step Function Workflow?

Explanation:

The entire workflow is triggered by a single HTTP POST endpoint. The JSON instructs the Step Function where to get the data and all the metadata it needs to complete the task.

When the Step function is called, the first thing it does is examine the JSON body and header to make sure it's valid JSON, and if it's not, it switches to Error mode and broadcasts an Event to SNS, indicating that the step function pipeline has failed with appropriate Error Message. Once the JSON has been checked and verified, a sample of the first few files will be taken to ensure that the S3 path supplied is legitimate; if it is not, the pipeline will fail, and an email notification will be sent explaining why.

Next Step in the workflow is to validate the hardware Specification before starting ingestion we make sure we have enough space on the cluster and if we do have enough space then we create mappings to New Elastic Search Cluster. We Do not hard code Mapping in code. The Mapping is stored in a JSON file that resides in AWS S3 which gives us the ability to change things very easily and make the pipeline generic enough.

The next step of this process is to check if we should start from first file. This is where we check if we have already processed the file we don’t want to process it again which is why after every file is processed we store the meta-information on AWS S3 which gives the ability to query the Lake with Filename and see if this file name already exists and status is successful do not process the file again. This block ensures that the same files are not processed again. Once we do decide which file, we want to process we then pass those keys to the next Step Function which is Map Operator, and process the Files. This is the most important step where files are processed in parallel.

The Lambda Functions has Pandas, DDOG, and Logging library which allows us to process and monitor process logs in Datadog. We have used a serverless framework to deploy our Lambda code which makes things very easy.

Figure:?Shows Retry Scenarios?

When a Step Fails, we have retry scenarios which make orchestration very easy here we attempt to process the files for maximum of 3 Times after which an appropriate Error is caught and moved to SQS so that metadata can be updated on AWS S3. We run Glue Crawler which identifies schema in AWS S3 and allows us to query the Meta?Data on Athena to see which?files have processed and which?files have been failed. AWS quick Sight provides us BI Dashboard which shows which file has been processed and which file has failed on beautiful Dashboard.

Metrics:?

领英推荐

Simplifying Data Processing with PySpark on Amazon…

Coditation 1 年前

Harnessing the Power of PySpark in DataBricks Delta…

New Math Data 6 个月前

Why use Delta Live Tables in Databricks?

RevoData 11 个月前

Figure: We Fire 10-20 Lambdas in parallel which process the files.?

Each file reads the data from AWS S3 and processes the files and dumps the data to the firehose. The reason we had to set reserved concurrency for 15 Files as there is a limitation on the kinesis Firehose side.

Figure: Incoming Bytes to Kinesis?

Challenges with Kinesis?

When Direct PUT is configured as the data source, each Kinesis Data Firehose delivery stream provides the following combined quota for PutRecord and PutRecordBatch requests

For US East (N. Virginia), US West (Oregon), and Europe (Ireland): 500,000 records/second, 2,000 requests/second, and 5 MiB/second.

The maximum size of a record sent to Kinesis Data Firehose, before base64-encoding, is 1,000 KiB.

The PutRecordBatch operation can take up to 500 records per call or 4 MiB per call, whichever is smaller. This quota cannot be changed.

The buffer sizes hints range from 1 MiB to 128 MiB for Amazon S3 delivery. For Amazon OpenSearch Service (OpenSearch Service) delivery, they range from 1 MiB to 100 MiB. For AWS Lambda processing, you can set a buffering hint between 1 MiB and 3 MiB using the BufferSizeInMBs processor parameter. The size threshold is applied to the buffer before compression. These options are treated as hints. Kinesis Data Firehose might choose to use different values when it is optimal (amazon. “Amazon Kinesis Data Firehose Quota.”)

Actually, we could have processed a 1000 Files per 2 minutes but I cannot publish that much amount of data to the firehose and hence had to set reserved concurrency?

Figure:?Records delivered to Open Search

Before running the pipeline, each lambda was powered tune so right amount of resources could be allocated. Why Power tunning is important?

Choosing the memory allocated to Lambda functions is an optimization process that balances speed (duration) and cost. While you can manually run tests on functions by selecting different memory allocations and measuring the time taken to complete, the?AWS Lambda Power Tuning?tool allows you to automate the process.

This tool uses AWS Step Functions to run multiple concurrent versions of a Lambda function at different memory allocations and measure the performance. The input function is run in your AWS account, performing live HTTP calls and SDK interaction, to measure likely performance in a live production scenario.?

Shows the Count of documents. The pipeline is still running while writing the articles and hence you see 54M we are expecting the count to be around 90-100M. as you can see we work and deal with massive Big Data and make sure search queries are optimized and cached to deliver the best performance on the application?

Thank you and please post your question in the comments and if you want to learn more I have a series on AWS Step Function

Learn AWS Step Functions :

Title: Beginner | : Learn AWS Step Function in a Very Easy way | Part #1

Title: Beginner | Learn AWS Step Function in Very Easy way| Hello World | Part #2

Title: Beginner | Learn AWS Step Function in Very Easy way| Retry Logic | Part #3

Title: Beginner | Learn AWS Step Function in Very Easy way| Catch Logic | Part #4

Beginner | Learn AWS Step Function in Very Easy way| Catch Custom Error Logic | Part #4A

Beginner | Learn AWS Step Function in Very Easy way| Choice & Branching| Part #5

Learn AWS Step Function in Very Easy way| Process Json Files in Batches | Part #6

AWS Step Function | Parallel Processing JSON Documents and Push Failed Items to DLQ| #7

Title: Async Callback Pattern using AWS Step function + SQS queue + Lambda in Python

要查看或添加评论，请登录

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论
Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

2025年2月14日

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Introduction In modern big data applications, managing concurrent writes to distributed storage systems like Amazon S3…

1 条评论
Leveraging S3 for Distributed Concurrency Control in Data Processing

2025年2月9日

Leveraging S3 for Distributed Concurrency Control in Data Processing

In distributed systems, managing concurrency—ensuring that only a set number of processes run in parallel—is crucial to…
Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

2025年2月8日

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

n this blog, we'll walk through creating and managing an EMR (Elastic MapReduce) cluster on EC2 to run PySpark jobs…

2 条评论
Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

2025年1月25日

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

If you have existing Iceberg tables and need to sync them with the AWS Glue Data Catalog, the iceberg-glue-syncPython…

1 条评论

See all articles

Fully Automated Data Ingestion Pipeline (Ingest 1.2TB) To Elastic Search using AWS Step function and Lambda and Firehose

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

领英推荐

Soumil S.的更多文章

社区洞察

其他会员也浏览了

Vectors for Oracle AI Vector Search

Scale with a K.I.S.S: Keep It Simple, Stupid

Spark - Managers' snapshot

JobTarget Internal Batch Framework that runs 5400 Jobs/Month 60,000 Jobs/Year on AWS Batch (Enhanced Architecture) Version 2.0 Future Plans

Understanding the PySpark

PySpark Introduction: Powering Big Data Processing with Apache Spark

Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

From a full-stack developer to full-stack data scientists

SPARK - Partitioning

领英推荐

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Leveraging S3 for Distributed Concurrency Control in Data Processing

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

社区洞察

其他会员也浏览了

Vectors for Oracle AI Vector Search

Scale with a K.I.S.S: Keep It Simple, Stupid

Spark - Managers' snapshot

JobTarget Internal Batch Framework that runs 5400 Jobs/Month 60,000 Jobs/Year on AWS Batch (Enhanced Architecture) Version 2.0 Future Plans

Understanding the PySpark

PySpark Introduction: Powering Big Data Processing with Apache Spark

Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

From a full-stack developer to full-stack data scientists

SPARK - Partitioning