登录查看更多内容

Event processing of data streams optimizing SQS processing and efficient end-user querying

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

发布日期: 2022年8月10日

Outline:

There are several systems in place that produce daily events. We produce more than 20,000 events each day on average (560,000 events per month), which must be processed, analyzed, and displayed quickly. Additionally, we have downstream applications that require subscribe to this data to get events almost immediately. We chose the most well-liked fan-out architecture utilizing SNS (Simple Notification Services), SQS (Simple Queue Services), and AWS Lambdas since we desired a reliable solution. By using the strategies listed below, we could reduce costs for SQS, Lambdas, and Datalake and save almost 40% on cost. We are processing 180GB of data each week with more than 1.2M objects every month

Authors:

Soumil Nitin Shah

I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as data collection and processing Team Lead at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

What is real-time data streams?

Real-time data is?data that is available as soon as it's created and acquired. Rather than being stored, data is forwarded to users as soon as it's collected and is immediately available — without any lag — which is crucial for supporting live, in-the-moment decision-making and powering downstream consumers

Architecture

Figure 1: Shows Architecture Diagram?

Producer are producing events and these events are published to FIFO SNS Topic as shown in diagram. The Messages are then broadcasted to consumers which are subscribers to this topic and ready to receive?events. Each event would now land into Amazon simple Queue services which will act as back pressure. Lambdas consumer the messages in batches and after transforming the data inserts the data into the datalake. The glue crawler identifies and stores the schemas in glue data catalog.?ETL Jobs run and process the data and partition the data and dumps it into the transform folder where Analyst can now query the curated data using AWS Athena and Build Reports and Dashboard using AWS Quick Sight.

As soon as data is published to SQS. We opted for some best industry-standard practices such as making sure we have the right configurations and settings. We opted for a long polling mechanism as it will help us to save on costs. We are also using a Retention period of 4 days for our applications. (Reducing Amazon SQS costs) explains the advantages of using long-polling over short-polling

Figure 2: Shows Total SQS Messages Received in 5-Minute Interval?

We have tried using some industry-leading best practices on AWS Lambda. Often developers when creating serverless functions either under allocate or over-allocate resources. We just wanted the best sweet place where we wanted the best costs and best performance. We opted for a tool called AWS Lambda power tunning (Profiling functions with AWS Lambda Power Tuning) [4] The author talks about best practices and benefits of power tunning

AWS Lambdas Optimization?

We have tried using some industry-leading best practices on AWS Lambda. Often developers when creating serverless functions either under allocate or over-allocate resources. We just wanted the best sweet place where we wanted the best costs and best performance. We opted for a tool called AWS Lambda power tunning (Profiling functions with AWS Lambda Power Tuning) The author talks about best practices and benefits of power tunning

Figure 3: Shows AWS Step Functions to Power tune Lambdas?

Figure 4: Shows AWS Lambdas Power Tunning results.

We were able to determine how much resources we needed to allocate by doing basic exercises. We were able to save thousands of dollars for the firm by completing this exercise. We can leverage the use of AWS Cost Optimizer to automate a cost-performance analysis for all the Lambda functions in an AWS account. This service evaluates functions that have run at least 50 times over the previous 14 days and provides automatic recommendations for memory allocation. You can opt-in from the Cost Optimizer console to use this free recommendation engine. We further saved up to 34% on cost on lambdas invocation by choosing AWS Graviton2 Processor

Figure: Shows Total Lambdas Invocation in 1 Week Filtered by 6 Hour Interval?

There are two main types of concurrencies

Reserved concurrency – Reserved concurrency guarantees the maximum number of concurrent instances for the function. When a function has reserved concurrency, no other function can use that concurrency. There is no charge for configuring reserved concurrency for a function.

Provisioned concurrency – Provisioned concurrency initializes a requested number of execution environments so that they are prepared to respond immediately to your function's invocations. Note that configuring provisioned concurrency incurs charges to your AWS account.

Brij kishore Pandey 2 个月前

Why AWS is investing in a zero-ETL future

Swami Sivasubramanian 1 年前

The Rise of EtLT(Extract, Tweak Light Transform, Load,…

XenonStack 2 周前

We started soon seeing AWS lambda being throttled during burst traffic loads (Lambda function scaling) hence we used reserved concurrency. We had reserved 500 concurrencies to prevent our function from being throttling. If Lambas fails we had ability to retry the lambdas 3 Times and then we move the messages into Dead Letter Queue so that later we can archive them back to the source queue?

AWS S3 Best Practices:

AWS lambdas are responsible for reading the data in batches from the SQS queue and dumping the data into the datalake. We have partitioned the data lake using dates and enabled server-side encryption. We have enabled Bucket Replication so automatically data is copied to Bucket in a different region to make sure we take care of incidents in case buckets got wiped off. We have enabled version on buckets as well and added object locks to make sure no one deletes the objects in buckets accidentally. We did set up life cycle policy which allows to further save cost on storage eventually moving the data into Glacier.

Cost Savings on Athena Best Practices

One should always partition their data to reduce the amount of data scanned he mentioned that one of the best choices for partition would be date-based partition. We did lot of reading on which file format to use as they are various formats such as parquet, Avro. Apache Parquet and Apache ORC are popular columnar data stores. They provide features that store data efficiently by employing column-wise compression.

Figure: Shows How we got 20X faster Speed by Using parquet file with snappy compression?

Figure: Shows We able to reduce the volume of data by 4X by using appropriate file formats?

Here is Graph that shows we were able to reduce the data size from 13GB to 3GB which is big reduction in size

Figure: Shows Cost Optimization

As we have learned Parquet is an open-source file format built to handle flat columnar storage data formats. Parquet operates well with complex data in large volumes. It is known for its both performant data compression and its ability to handle a wide variety of encoding types.?Parquet deploys Google's record-shredding and assembly algorithm that can address complex data structures within data storage. Some Parquet benefits include Fast queries that can fetch specific column values without reading full row data, highly efficient column-wise compression, High compatibility with OLAP.

While CSV is simple and the most widely used data format there are several distinct advantages for Parquet, including, Parquet is column-oriented, and CSV is row-oriented. Row-oriented formats are optimized for OLTP workloads while column-oriented formats are better suited for analytical workloads. Column-oriented databases such as AWS Redshift Spectrum bill by the amount data scanned per query. Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly improves scan and deserialization time, hence the overall costs

Figure:?We are processing 184GB of Data Every 14 Days with this Architecture?

Figure 1.27M Objects processed every Month

References

hazelcast. “What Is Real-Time Stream Processing?” Accessed August 9, 2022. https://hazelcast.com/glossary/real-time-stream-processing/.

amazon. “Reducing Amazon SQS Costs.” Accessed August 9, 2022. https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/reducing-costs.html.

amazon. “Profiling Functions with AWS Lambda Power Tuning.” Accessed August 9, 2022. https://docs.aws.amazon.com/lambda/latest/operatorguide/profile-functions.html.

Amazon. “Amazon Athena.” Accessed April 11, 2022.?https://aws.amazon.com/athena/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc.

Hocanin, Mert . “Top 10 Performance Tuning Tips for Amazon Athena.” Amazon. Accessed April 11, 2022.?https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/.

?Logvinskiy,?Valentin . “Which Columnar File Format to Select for Athena / BigQuery / Synapse Analytics.” linkedin. Accessed April 11, 2022.?https://www.dhirubhai.net/pulse/which-columnar-file-format-select-athena-bigquery-valentin-logvinskiy/.

“WHAT IS PARQUET?” Accessed April 11, 2022.?https://www.snowflake.com/guides/what-parquet#:~:text=Parquet%20is%20an%20open%20source,wide%20variety%20of%20encoding%20types.

要查看或添加评论，请登录

查看全部

Event processing of data streams optimizing SQS processing and efficient end-user querying

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

DATA ENGINEERING: SKILLS IN DEMAND

?? DATA Pill #097 - LLMs meet SQL, Confluent + Apache Flink = ?

?? DATA Pill #104 - What can LLMs never do?, Kafka Connect: A Love/Hate Relationship

Databricks: A Contemporary Solution for Today’s Data Engineering Obstacles

How modern data-analytics architecture works with Azure Databricks

AWS Data Engineering Essentials Guidebook

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Revolutionizing Data Engineering with Delta Lake and Azure Databricks

Simplifying Data Work with Amazon EMR and PySpark for Data Processing and Analysis

Data Engineering on AWS

领英推荐

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

2024年9月30日

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

2024年9月29日

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

2024年9月21日

Building a Universal Data Lake with EMR Serverless: Hands-On Labs for Querying with Snowflake, Athena, and Spark – A Guide for Beginners, Leaders

2024年9月15日

How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide

2024年9月8日

Developer Guide: How to Submit Hudi PySpark(Python) Jobs to EMR Serverless (7.1.0) with AWS Glue Hive MetaStore

2024年9月4日

Learn How to Use DeltaStreamer with JDBC Puller to Pull Data from Snowflake into Universal Table Formats (Iceberg | Delta | Hudi): Hands-on Labs

2024年8月30日

Leveraging S3 External Volumes in Snowflake for Iceberg Tables and Using X-Table for Universal Data Lakes | Step by Step Guides