登录查看更多内容

Building a Serverless ETL Application in AWS

Ashwin Ashok

Actively seeking full-time Data Engineer job | Cloud Data Engineering (ETL Data Pipeline Development) - Python, Advanced SQL, PySpark, Databricks, Snowflake, Airflow, dbt, Tableau, AWS | MS CSE @ SUNY - Buffalo

发布日期: 2024年6月1日

Let's start with understanding what serverless and ETL pipeline is. Serverless means there is no centralized machine to manage the tasks assigned to worker nodes. ETL pipeline is a set of worker nodes, each of them performing 1 task of either of the 3 majors functions - Extract (data ingestion), Transform (data orchestration), Load (data visualization).

Serverless ETL Pipelines are workflows where worker nodes performing tasks communicate with one another (without the need for a server) to get the ETL Job completed.

Below is the Architecture followed in building a Serverless ETL Application in AWS. All ETL pipelines will have the 3 parts as in the diagram.

PART 1: Data Ingestion

This is the part where data is extracted from external sources and ingested into AWS.

1. Storing data in AWS S3 and use AWS Athena to query it.

Data in AWS is stored in S3 buckets and Athena query editor is used to explore all the data within S3.

S3 buckets are created to store data and accessed as and when needed by AWS services throughout the ETL pipeline.

Athena is the serverless query editor of AWS used to work with data and analyse the data within the tables.

2. Use Lambda functions to ingest data to S3.

AWS Lambda function (python script) ingests data extracted from source into S3.

Lambda functions are serverless mode of running python code. Python scripts can be run in AWS through Lambda without the need for a virtual machine (compute power).

Lambda functions are created in AWS to run data ingestion scripts that will pull data from external sources and push into S3. We can use Athena to check the data is correctly stored in the tables in the way we need.

3. Automate execution of Lambda functions with EventBridge triggers

Use Trigger on lambda functions to automatically ingest data into S3.

Triggers are the way we can invoke/schedule lambda functions without the need for manual intervention.

EventBridge is the AWS trigger service used on lambda functions to execute them repeatedly. EventBridge triggers are created and attached to the lambda function we want to invoke. We can enable and disable triggers based on our need.

PART 2: Data Transformation (major part in an ETL pipeline)

1. Batching the ingested data using AWS Firehose (previously known as Kinesis Firehose)

The ingested data can be batched which will later be highly helpful while analysing a particular portion of the data we ingested.

AWS Firehose is the serverless batching data service that batches the data captured from the source and stores it in S3. Firehose is highly scalable in a way it can actually handle as high volumes as we want.

领英推荐

Best ETL Tools For AWS

Hexaview Technologies Inc. 1 年前

Detailed Guide on DataBricks Delta?Lake- Part 1

Krishna Yogi Kolluru 1 年前

AWS GLUE

Rohit Singh 5 个月前

A Firehose instance is created and configured based on how we want our data to be batched. Now, attaching it to a lambda function which extracts data from a source will ingest data into Firehose which in turn places data in batches into S3.

2. Build table from the batched data using Glue Crawler

Crawl all batched data by AWS Firehose to automatically build a table where the table will contain data from all the batches. Crawler is part of a larger service called AWS Glue.

3. Create jobs in AWS Glue used in Glue workflow to prepare the production data used for visualization

Parquet is a column store data format. Unlike row store data format which groups data in a row as a single object, parquet format groups data in a column as a single object. Storing the column grouped data is good as compression algorithms operate on homogenous data and when querying this data will be cheaper. Parquet files can be partitioned similar to data created by AWS Firehose, partitioned tables allows the query to pull the data happened to be partition can result in reducing query cost and increasing query speed.

Glue jobs (lambda functions) (worker nodes) performs a single unit of work. Each job does a portion for work required to transform the raw data into desired dataset. Each job runs a transformation in SQL which is passed to AWS Athena comes back with query results making it available to be pushed into S3 in desired format required for the next job.

Important: AWS Glue job runs in form of DPU (Data Processing Unit) which is a VM with compute and memory. We will be charged in terms of DPUs used for every Glue job we run.

4. Build the Glue workflow to put together all the jobs created in previous step to get the production data ready

Glue workflow is orchestrating Glue jobs executed in a sequence to transform raw data into prod data.

General workflow will contain 4 Glue jobs.

a. Invoke the Glue crawler to crawl the data ingested by AWS Firehose and create the table for the next step of workflow

b. Trigger the Delete Glue job (Glue Job 1) to remove any old parquet table in S3 bucket

c. Trigger the Create Glue job (Glue Job 2) to create fresh parquet format partitioned table and place it in S3

d. Trigger the Data Quality Check Glue job (Glue Job 3) to check the quality of data in the parquet table created by the Create Glue job.

e. Trigger the Prod Glue job (Glue Job 4) to store a final desirable version of the parquet table used for visualization.

PART 3: Data Visualization

Grafana is an open-source visualization tool which can be easily integrated with AWS. By configuration and connecting Grafana and AWS Athena, we can visualize the data in the tables inside S3 buckets.

The use-case I achieved in this project is Checking the availability of taxis in Singapore during business hours 8:00 A.M. to 8:00 P.M. (Singapore local time)

External Source - Singapore Gov. Real-time API - Taxi Availability - https://beta.data.gov.sg/datasets/d_e25662f1a062dd046453926aa284ba64/view

Data Analysed - The taxis available (in hundreds) in Singapore on May 30th, 2024 between 8:00 A.M. and 8:00 P.M.

David Freitag

Senior Data Engineer at American Family Insurance | SQL, Python, AWS/GCP, ETL Pipelines

9 个月

Awesome work, Ashwin Ashok!

1 次回应

Ashwin Ashok

Actively seeking full-time Data Engineer job | Cloud Data Engineering (ETL Data Pipeline Development) - Python, Advanced SQL, PySpark, Databricks, Snowflake, Airflow, dbt, Tableau, AWS | MS CSE @ SUNY - Buffalo

9 个月

Resource - Maven course by David Freitag - https://maven.com/david-freitag/first-serverless-de-project

1 次回应

查看更多评论

要查看或添加评论，请登录

Ashwin Ashok的更多文章

GraphQL Fundamentals: GraphQL Architecture – Clients and Server

2024年5月13日

GraphQL Fundamentals: GraphQL Architecture – Clients and Server

A) Use Cases where GraphQL is powerful 1. GraphQL server with a connected database A single web server implements the…
GraphQL Fundamentals: Core Cocepts of GraphQL

2024年5月4日

GraphQL Fundamentals: Core Cocepts of GraphQL

1. Benefits of a Schema and Type System GraphQL uses a strongly type system (like Python) to define the various…

1 条评论

Building a Serverless ETL Application in AWS

Ashwin Ashok

Actively seeking full-time Data Engineer job | Cloud Data Engineering (ETL Data Pipeline Development) - Python, Advanced SQL, PySpark, Databricks, Snowflake, Airflow, dbt, Tableau, AWS | MS CSE @ SUNY - Buffalo

PART 1: Data Ingestion

PART 2: Data Transformation (major part in an ETL pipeline)

领英推荐

PART 3: Data Visualization

Ashwin Ashok的更多文章

社区洞察

其他会员也浏览了

Mastering Data Transformation with AWS Glue: A Comprehensive Guide to Building ETL Pipelines

ETL workflow

The ETL to ELT to EtLT Evolution, and data pipelines

Top 10 Data Pipeline Tools: Use Cases

ETL IS DEAD

To hell and back with ETL. The unstoppable rise of data warehouse automation.

Reverse ETL vs. ETL

AWS Glue, Athena and Visual ETL based Data Quality Improvement

Ace Microsoft Fabric: Understanding Dataflows Gen2

Efficient Data Ingestion with Glue Concurrency: Using a Single Template for Multiple S3 Tables into a Transactional Hudi Data Lake

PART 1: Data Ingestion

PART 2: Data Transformation (major part in an ETL pipeline)

领英推荐

PART 3: Data Visualization

Ashwin Ashok的更多文章

GraphQL Fundamentals: GraphQL Architecture – Clients and Server

GraphQL Fundamentals: Core Cocepts of GraphQL

社区洞察

其他会员也浏览了

Mastering Data Transformation with AWS Glue: A Comprehensive Guide to Building ETL Pipelines

ETL workflow

The ETL to ELT to EtLT Evolution, and data pipelines

Top 10 Data Pipeline Tools: Use Cases

ETL IS DEAD

To hell and back with ETL. The unstoppable rise of data warehouse automation.

Reverse ETL vs. ETL

AWS Glue, Athena and Visual ETL based Data Quality Improvement

Ace Microsoft Fabric: Understanding Dataflows Gen2

Efficient Data Ingestion with Glue Concurrency: Using a Single Template for Multiple S3 Tables into a Transactional Hudi Data Lake