Event processing of data streams optimizing SQS processing and efficient end-user querying
Outline:
There are several systems in place that produce daily events. We produce more than 20,000 events each day on average (560,000 events per month), which must be processed, analyzed, and displayed quickly. Additionally, we have downstream applications that require subscribe to this data to get events almost immediately. We chose the most well-liked fan-out architecture utilizing SNS (Simple Notification Services), SQS (Simple Queue Services), and AWS Lambdas since we desired a reliable solution. By using the strategies listed below, we could reduce costs for SQS, Lambdas, and Datalake and save almost 40% on cost. We are processing 180GB of data each week with more than 1.2M objects every month
Authors:
Soumil Nitin Shah
I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as data collection and processing Team Lead at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search
What is real-time data streams?
Real-time data is?data that is available as soon as it's created and acquired. Rather than being stored, data is forwarded to users as soon as it's collected and is immediately available — without any lag — which is crucial for supporting live, in-the-moment decision-making and powering downstream consumers
Architecture
Figure 1: Shows Architecture Diagram?
Producer are producing events and these events are published to FIFO SNS Topic as shown in diagram. The Messages are then broadcasted to consumers which are subscribers to this topic and ready to receive?events. Each event would now land into Amazon simple Queue services which will act as back pressure. Lambdas consumer the messages in batches and after transforming the data inserts the data into the datalake. The glue crawler identifies and stores the schemas in glue data catalog.?ETL Jobs run and process the data and partition the data and dumps it into the transform folder where Analyst can now query the curated data using AWS Athena and Build Reports and Dashboard using AWS Quick Sight.
As soon as data is published to SQS. We opted for some best industry-standard practices such as making sure we have the right configurations and settings. We opted for a long polling mechanism as it will help us to save on costs. We are also using a Retention period of 4 days for our applications. (Reducing Amazon SQS costs) explains the advantages of using long-polling over short-polling
Figure 2: Shows Total SQS Messages Received in 5-Minute Interval?
We have tried using some industry-leading best practices on AWS Lambda. Often developers when creating serverless functions either under allocate or over-allocate resources. We just wanted the best sweet place where we wanted the best costs and best performance. We opted for a tool called AWS Lambda power tunning (Profiling functions with AWS Lambda Power Tuning) [4] The author talks about best practices and benefits of power tunning
AWS Lambdas Optimization?
We have tried using some industry-leading best practices on AWS Lambda. Often developers when creating serverless functions either under allocate or over-allocate resources. We just wanted the best sweet place where we wanted the best costs and best performance. We opted for a tool called AWS Lambda power tunning (Profiling functions with AWS Lambda Power Tuning) The author talks about best practices and benefits of power tunning
Figure 3: Shows AWS Step Functions to Power tune Lambdas?
Figure 4: Shows AWS Lambdas Power Tunning results.
We were able to determine how much resources we needed to allocate by doing basic exercises. We were able to save thousands of dollars for the firm by completing this exercise. We can leverage the use of AWS Cost Optimizer to automate a cost-performance analysis for all the Lambda functions in an AWS account. This service evaluates functions that have run at least 50 times over the previous 14 days and provides automatic recommendations for memory allocation. You can opt-in from the Cost Optimizer console to use this free recommendation engine. We further saved up to 34% on cost on lambdas invocation by choosing AWS Graviton2 Processor
Figure: Shows Total Lambdas Invocation in 1 Week Filtered by 6 Hour Interval?
There are two main types of concurrencies
Reserved concurrency – Reserved concurrency guarantees the maximum number of concurrent instances for the function. When a function has reserved concurrency, no other function can use that concurrency. There is no charge for configuring reserved concurrency for a function.
Provisioned concurrency – Provisioned concurrency initializes a requested number of execution environments so that they are prepared to respond immediately to your function's invocations. Note that configuring provisioned concurrency incurs charges to your AWS account.
领英推荐
We started soon seeing AWS lambda being throttled during burst traffic loads (Lambda function scaling) hence we used reserved concurrency. We had reserved 500 concurrencies to prevent our function from being throttling. If Lambas fails we had ability to retry the lambdas 3 Times and then we move the messages into Dead Letter Queue so that later we can archive them back to the source queue?
AWS S3 Best Practices:
AWS lambdas are responsible for reading the data in batches from the SQS queue and dumping the data into the datalake. We have partitioned the data lake using dates and enabled server-side encryption. We have enabled Bucket Replication so automatically data is copied to Bucket in a different region to make sure we take care of incidents in case buckets got wiped off. We have enabled version on buckets as well and added object locks to make sure no one deletes the objects in buckets accidentally. We did set up life cycle policy which allows to further save cost on storage eventually moving the data into Glacier.
Cost Savings on Athena Best Practices
One should always partition their data to reduce the amount of data scanned he mentioned that one of the best choices for partition would be date-based partition. We did lot of reading on which file format to use as they are various formats such as parquet, Avro. Apache Parquet and Apache ORC are popular columnar data stores. They provide features that store data efficiently by employing column-wise compression.
Figure: Shows How we got 20X faster Speed by Using parquet file with snappy compression?
Figure: Shows We able to reduce the volume of data by 4X by using appropriate file formats?
Here is Graph that shows we were able to reduce the data size from 13GB to 3GB which is big reduction in size
Figure: Shows Cost Optimization
As we have learned Parquet is an open-source file format built to handle flat columnar storage data formats. Parquet operates well with complex data in large volumes. It is known for its both performant data compression and its ability to handle a wide variety of encoding types.?Parquet deploys Google's record-shredding and assembly algorithm that can address complex data structures within data storage. Some Parquet benefits include Fast queries that can fetch specific column values without reading full row data, highly efficient column-wise compression, High compatibility with OLAP.
While CSV is simple and the most widely used data format there are several distinct advantages for Parquet, including, Parquet is column-oriented, and CSV is row-oriented. Row-oriented formats are optimized for OLTP workloads while column-oriented formats are better suited for analytical workloads. Column-oriented databases such as AWS Redshift Spectrum bill by the amount data scanned per query. Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly improves scan and deserialization time, hence the overall costs
Figure:?We are processing 184GB of Data Every 14 Days with this Architecture?
Figure 1.27M Objects processed every Month
References
hazelcast. “What Is Real-Time Stream Processing?” Accessed August 9, 2022. https://hazelcast.com/glossary/real-time-stream-processing/.
amazon. “Reducing Amazon SQS Costs.” Accessed August 9, 2022. https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/reducing-costs.html.
amazon. “Profiling Functions with AWS Lambda Power Tuning.” Accessed August 9, 2022. https://docs.aws.amazon.com/lambda/latest/operatorguide/profile-functions.html.
Amazon. “Amazon Athena.” Accessed April 11, 2022.?https://aws.amazon.com/athena/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc.
Hocanin, Mert . “Top 10 Performance Tuning Tips for Amazon Athena.” Amazon. Accessed April 11, 2022.?https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/.
?Logvinskiy,?Valentin . “Which Columnar File Format to Select for Athena / BigQuery / Synapse Analytics.” linkedin. Accessed April 11, 2022.?https://www.dhirubhai.net/pulse/which-columnar-file-format-select-athena-bigquery-valentin-logvinskiy/.
“WHAT IS PARQUET?” Accessed April 11, 2022.?https://www.snowflake.com/guides/what-parquet#:~:text=Parquet%20is%20an%20open%20source,wide%20variety%20of%20encoding%20types.