Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day

Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day

Soumil Nitin Shah (Data collection and Processing Team lead)

I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as a data collection and processing specialist at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS?

Hari Om Dubey (Consultant Software Engineer, Python developer)

I have completed a Master’s in Computer Application and I have 5 years of experience in developing software applications using Python and Django frameworks. I love to code in Python and creating a solution for a problem by coding excites me. I have been working at Jobtarget for like past 2 months as a Software Engineer in a Data Team.

April Love Ituhat (Software Engineer, Python)

I have a bachelor’s degree in Computer Engineering and have spent the last three years working on Research and development tasks?involving diverse domains such as AWS, Machine Learning, Robot simulations, and IoT. I've been a part of the JobTarget data team since November 2021, and I usually work with python and AWS. It's exciting for me to see the applications come to fruition.

Himadri Chobisa?(Jr Data Engineer)

I recently graduated from UConn with a master’s in Business Analytics and Project Management with a Business Data Science concentration. I joined JobTarget as an intern in April last year and have been with the data team since then. I am a data enthusiast with experience working in SQL, Python, PowerBi, Tableau, and Machine Learning. In my free time, I enjoy dancing and cooking.

Birendra Singh (Python Developer | Search Engineer | Data Specialist)

Hi, I am Birendra Singh I have completed Bachelor in Electronic Engineering. I love and enjoy working on an elastic search I have 6 years of professional experience in software development lifecycle spanning across multiple sectors including Telecommunications, Geospatial, Oil, and Gas with a focus on quality and on-time delivery.

?Special Thanks to our mentors:

Paul Allan (Senior Director of Data and Business Analytics at JobTarget)

Donald Sipe (Solutions Architect at JobTarget)

Project Overview:

We had data scattered in various places such as MongoDB, SQL Server, DynamoDB and it was not easy to access all the data in one place. We attempt to solve problems by having all data in one place (Data Lake) so?Businesses can analyze data easily. In this article, I would be discussing how we attempted to use Mongo Streams and populate the Lake creating date partition speed execution of queries and making data easily accessible to businesses.?

Fun Facts :

We deal with a large dataset and use an internal batch framework to analyze 1TB of data. To ensure that data is available to companies, we relocate and migrate data. The team has created an internal batch framework that processes 1TB of data; for more information, see the article below.

https://www.dhirubhai.net/pulse/batch-frameworkan-internal-data-ingestion-framework-process-shah/

No alt text provided for this image

Figure 1: Shows just how much data we process in a matter of months

Architecture :

No alt text provided for this image

Explanation:?

What is Mongo DB Streams?

Change streams allow applications to access real-time data changes without the complexity and risk of tailing the?oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them. Because change streams use the aggregation framework, applications can also filter for specific changes or transform the notifications at will[1]

Video:

When a new item is inserted into MongoDB we have Mongo DB Streams which will bring data in Near real-time and publish the event to Event Bus. We have rules to only publish certain messages to SQS. Any document that doesn't meet the filtering rule will be discarded.

We opted for event Bridge as it is a serverless event bus that makes it easier to build event-driven applications at scale using events generated from your applications, integrated Software-as-a-Service (SaaS) applications, and AWS services. EventBridge delivers a stream of real-time data from event sources such as Zendesk or Shopify to targets like AWS Lambda and other SaaS applications. You can set up routing rules to determine where to send your data to build application architectures that react in real-time to your data sources with event publisher and consumer completely decoupled[2]

As soon as data is published to SQS. We opted for some best industry-standard practices such as making sure we have the right configurations and settings. We opted for a long polling mechanism as it will help us to save on costs. We are also using a Retention period of 4 days for our applications. (Reducing Amazon SQS costs)[3] explains the advantages of using long-polling over short-polling.?

SQS Metrics :

No alt text provided for this image

Figure: SQS Received Messages?in one day

As you can see we process a large number of messages every day.?

We have tried using some industry-leading best practices on AWS Lambda. Often developers when creating serverless functions either under allocate or over-allocate resources. We just wanted the best sweet place where we wanted the best costs and best performance. We opted for a tool called AWS Lambda power tunning (Profiling functions with AWS Lambda Power Tuning) [4] The author talks about best practices and benefits of power tunning

No alt text provided for this image

Figure: Shows step function for power tunning lambdas for different memory?

No alt text provided for this image

Figure: Shows results for power tunning?

We were able to determine how much resources we needed to allocate by doing basic exercises. We were able to save thousands of dollars for the firm by completing this exercise;?

I have a video where I teach everyone how to tune your AWS Lambda

We can leverage the use of AWS Cost Optimizer to automate a cost-performance analysis for all the Lambda functions in an AWS account. This service evaluates functions that have run at least 50 times over the previous 14 days and provides automatic recommendations for memory allocation. You can opt-in from the Cost Optimizer console to use this free recommendation engine[4]

We started soon seeing AWS lambda being throttle during burst traffic loads (Lambda function scaling) Author explains the need to provision and reserve concurrency. We had reserved 100 concurrencies to prevent our function from being throttling

Metrics on Lambdas:

No alt text provided for this image

The figure shows the Amount of lambda that is invoked in 1 week.

If you want to learn more about AWS lambda Concurrency [5] article has described concurrency in detail and I do have a video on lambda concurrency which can be found here

There are two main types of concurrency

  • Reserved concurrency – Reserved concurrency guarantees the maximum number of concurrent instances for the function. When a function has reserved concurrency, no other function can use that concurrency. There is no charge for configuring reserved concurrency for a function.
  • Provisioned concurrency – Provisioned concurrency initializes a requested number of execution environments so that they are prepared to respond immediately to your function's invocations. Note that configuring provisioned concurrency incurs charges to your AWS account.

AWS S3:

AWS lambdas are responsible for reading the data in batches from the SQS queue and dumping the data into the database. We have partitioned the data lake using dates and enabled server-side encryption. We have enabled Bucket Replication so automatically data is copied to Bucket in a different region to make sure we take care of incidents in case buckets got wiped off. We have enabled version on buckets as well and added object locks to make sure no one deletes the objects in buckets accidentally.?

AWS Glue Crawlers?

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months. We run glue crawlers on AWS S3 Buckets which populates the Meta Database ie presto and we use a tool called Athena to query the data lake.

What is Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

Athena is out-of-the-box integrated with?AWS Glue?Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas, populate your Catalog with new and modified table and partition definitions, and maintain schema versioning.

We always want to go with best practices we started reading more on what people are recommending industry (Improve Performance with Amazon Athena's Latest Updates - AWS Online Tech Talks) Author talks about some best practices

I would like to provide some screenshots which the author explain in the video

No alt text provided for this image

References: (AWS Online Tech Talks)

No alt text provided for this image

References : (AWS Online Tech Talks)

Lastly, We wanted to follow some best tagging practices to make sure we tag the resources properly. Author Karl explains some good practices in his blog [6]

We also wanted to dive deeper into best practices for AWS SQS (AWS re:Invent 2019: Scalable serverless event-driven applications using Amazon SQS & Lambda (API304))

Shows and discuss best practices for serverless application?

Shows and discuss best practices for serverless application?

References: (AWS re:Invent 2019: Scalable serverless event-driven applications using Amazon SQS & Lambda (API304))

Bring data from other Sources we opted for a hybrid solution to scale on demands?

No alt text provided for this image

We Bring data from other sources into datalake using the internal Batch framework and internal framework can run 1000+ jobs and can easily be scaled up or down based on needs. They publish the data to SQS queue and we fire lambdas which transform and dumps data into data lake

If you want to read more about Internal batch Framework: https://www.dhirubhai.net/pulse/batch-frameworkan-internal-data-ingestion-framework-process-shah/

Amazon SQS supports?dead-letter queues?(DLQ), which other queues (source queues) can target for messages that can't be processed (consumed) successfully. Dead-letter queues are useful for debugging your application or messaging system because they let you isolate unconsumed messages to determine why their processing doesn't succeed. For information about creating a queue and configuring a dead-letter queue for it using the Amazon SQS console, see?Configuring a dead-letter queue (console). Once you have debugged the consumer application or the consumer application is available to consume the message, you can use the?dead-letter queue redrive capability?to move the messages back to the source queue with just a click of a button on the Amazon SQS console.

How do dead-letter queues work?

Sometimes, messages can't be processed because of a variety of possible issues, such as erroneous conditions within the producer or consumer application or an unexpected state change that causes an issue with your application code. For example, if a user places a web order with a particular product ID, but the product ID is deleted, the web store's code fails and displays an error, and the message with the order request is sent to a dead-letter queue.

Occasionally, producers and consumers might fail to interpret aspects of the protocol that they use to communicate, causing message corruption or loss. Also, the consumer's hardware errors might corrupt the message payload.(AWS)

The?redrive policy?specifies the?source queue, the?dead-letter queue, and the conditions under which Amazon SQS moves messages from the former to the latter if the consumer of the source queue fails to process a message a specified number of times. The?maxReceiveCount?is the number of times a consumer tries receiving a message from a queue without deleting it before being moved to the dead-letter queue. Setting the?maxReceiveCount?to a low value such as 1 would result in any failure to receive a message to cause the message to be moved to the dead-letter queue. Such failures include network errors and client dependency errors. To ensure that your system is resilient against errors, set the?maxReceiveCount?high enough to allow for sufficient retries(AWS)

The?redrive allow policy?specifies which source queues can access the dead-letter queue. This policy applies to a potential dead-letter queue. You can choose whether to allow all source queues, allow specific source queues, or deny all source queues. The default is to allow all source queues to use the dead-letter queue. If you choose to allow specific queues (using the?queue?option), you can specify up to 10 source queues using the source queue Amazon Resource Name (ARN). If you specify?denyAll, the queue cannot be used as a dead-letter queue.(AWS)



References:

[1] https://docs.mongodb.com/manual/changeStreams/

[2] https://aws.amazon.com/eventbridge/

[3]https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/reducing-costs.html

[4] https://docs.aws.amazon.com/lambda/latest/operatorguide/profile-functions.html

[5] https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html

[6] https://www.cloudforecast.io/blog/aws-tagging-best-practices/

[7]https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html



要查看或添加评论,请登录

Soumil S.的更多文章

社区洞察

其他会员也浏览了