登录查看更多内容

Some Thoughts on Monitoring SQS

Victor Moreno

Senior Grug @AWS | ?? web systems & career advice

发布日期: 2022年1月30日

The following all apply to a company that has not yet adopted the practice of microservices, or if they have, they have not yet broken up the tenancy of said microservices (ie. All the systems live in the same AWS region and account).

Case 1: Asynchronous process, not time sensitive, with reserved concurrency

An example of this use case would be sending metrics to mixpanel. Typically you don't want to bog down your client-side or even server-side applications with latency of sending a metric to mixpanel synchronously. If you do this on the server side, it adds a little bit of latency. If you do it on the client side, there's the latency issue too but what's even worse is that you risk losing metrics whenever your users refresh their page or quit before the metric is sent. So, what you typically do is have a very fast endpoint on your backend that acknowledges receipt of the request and fires it off to sns (it might even do that asynchronously). Then there's an sqs queue connected to sns, and finally a lambda consumer for the sqs queue. Queues of this nature should have reserved concurrency limits specified so that there's no possibility of them choking your critical synchronous workloads by consuming your lambda concurrent execution limits.

Metrics to monitor:

NumberOfMessagesVisible > n: If a queue is functioning under its reserved concurrency allocation, the number of messages visible should always be low. Even if the queue gets thousands of messages per minute, the number of messages visible is typically around 0-5 on average across a five-minute data point. This is because sqs and lambda's sqs poller are pretty fast, so messages come in and within a second or two they're picked up by a worker. If there's a problem and your queue is getting backed up, this metric will quickly spike. I'd recommend setting n equal to the queue's reserved concurrency. If your queue is getting slammed beyond its capacity then the number of messages visible will spike super high, super fast anyway.
Throttles on the lambda consumer. This is a complement to #1 above, since when visible messages spike high that's typically because of one of two reasons: either your system is getting slammed with more traffic than the reserved concurrency limit, or somebody shipped a bug to your lambda consumer that is causing it to not read messages. By having monitors on both of these metrics you can quickly see, at a glance, if your queue is backed up because of traffic or because of problems with the queue consumer. In the former case, both alarms will go off. In the latter case, NumberOfMessagesVisible?will go off while lambda Throttles might not.
NumberOfMessagesSent < n. If your system has 24/7 traffic (from users or your own canary systems) then this is a great monitor to have so you know when you're not getting traffic for some reason. It can alert you about other things in your system being in a bad state.
If your system doesn't have automated DLQ handlers, then set an alarm for when NumberOfMessagesVisible in the DLQ > n too.
ApproximateAgeOfOldestMessage > n. Since this queue is not time sensitive, you want to be generous with the limit here.

领英推荐

Amazon MSK Server less Now Generally Available–No More…

Abhishek Shrivastava 2 年前

Mastering AWS: A Comprehensive Guide to Building…

John Murillo-Giraldo 8 个月前

AWS API Gateway With Amit Kumar

AMIT KUMAR 3 年前

Case 2: Asynchronous process, time sensitive, without reserved concurrency

An example use case here might be something like a long-running process initiated by clients, where the client fires off a request to do a thing, the thing usually takes 10-45 seconds, the backend does the thing asynchronously, and the client polls the backend for status while the thing is running. Think of something like provisioning a new resource, manual test runs in CI, etc.

In this case, since your process is time sensitive, you can't actually put a reserved concurrency limit on it. Because of this, the first two monitors mentioned above are much more limited in usefulness (you should still have them because they will let you know when you've hit your global lambda invocation limit per region per account). So you'll still have all the same nonitors as above, but the main one you wanna focus on is ApproximateAgeOfOldestMessage.?If this spikes beyond a few seconds, something is very wrong, and you're not doing the thing your customers want you to do for them.

Case 3: The queue is consumed by a non-elastic consumer

In some cases, queue consumers have to share so much code with your app server that it makes sense to put them in the same code package as your backend. In these cases, they'll probably not be deployed to lambda (this would be a cold start nightmare) but instead something like ECS or EC2. While it is possible to set up autoscaling in these scenarios by looking at cpu and memory utilization, many companies don't do it. If this is the case for some of your queues, you still want all the monitors above except for lambda throttles, and you want to replace that one with a monitor on consumer capacity utilization. If your setup allows you to, the ideal monitor here would be on the percentage of active queue consumer threads with respect to total capacity. If not, then you can use P90 memory utilization of your consumer cluster as a proxy.

To expand more on the above, let's say your consumer cluster is 16 containers, each one running sqs consumers with 64 threads of capacity. Thus, your total capacity for consuming messages is 16*64 = 1,024 simultaneous executions. If you can have a monitor when actual utilization reaches 75% of 1024, that would be ideal. If you can't then alarm on a more general memory metric, such as jmx heap utilization if you're using a java backend. As a last resort you can alarm on the memory utilization of the container/VM itself, but in many setups this alarm would rarely if ever fire. This is because in many cases people don't take the time to tune the resource allocation of their language runtime to max out the capacity of their server. So you might have a NodeJS server with 4gb of ram but your version of node has a default max heap size of 1gb. So if you alarm on the VM's memory utilization, that alarm is only gonna fire if someone hits you with a mining virus or something.

要查看或添加评论，请登录

Victor Moreno的更多文章

On Impostor Syndrome and Introspection

2023年7月8日

On Impostor Syndrome and Introspection

Imposter syndrome and Dunning-Kruger syndrome exist on opposite ends of the same spectrum. I think it's important to…

17 条评论
Tech Cofounder Vs Fractional CTO

2023年1月9日

Tech Cofounder Vs Fractional CTO

If you're considering throwing yourself into the shark-infested waters that is tech entrepreneurship and you can't…

13 条评论
Clean Code: Losing sight of the forest for the trees

2022年10月6日

Clean Code: Losing sight of the forest for the trees

Part 1: Clean Code and Code Architecture Don't Matter if the Systems Architecture Sucks In this essay I will argue that…

30 条评论
Which Kubernetes Service Should I Use For My Startup?

2022年6月23日

Which Kubernetes Service Should I Use For My Startup?

TL;DR: don’t use Kubernetes for your startup Introduction Let me start by saying that my actual experience with…

5 条评论
Five Tips for Systems Design Interviews

2022年6月10日

Five Tips for Systems Design Interviews

Most engineers do not know how to take control of and do their best in a system design interview question. This article…
How to Evaluate Tech Proposals as a Non-Technical Founder

2022年3月15日

How to Evaluate Tech Proposals as a Non-Technical Founder

Who this article is for: you’re a non technical founder. You’re already familiar with the Lean Startup way to build a…
Job Interview Rejections in Perspective

2022年3月11日

Job Interview Rejections in Perspective

I'm writing this article thinking about all the times when I've gotten an angry email or angry feedback from a…

1 条评论
Database Choices for Startups

2021年1月31日

Database Choices for Startups

I've found myself having several discussions lately about NoSQL vs SQL recently. To be clear, I think that almost every…
You Undervalue Developers

2017年12月2日

You Undervalue Developers

Good software development requires talented people. Not unlike good music, theatre and science.
Education is Tough to Scale

2016年7月8日

Education is Tough to Scale

A good educator is somewhat of a conductor, guiding the music of learning and acting in a feedback look with the…

See all articles

Some Thoughts on Monitoring SQS

Victor Moreno

Senior Grug @AWS | ?? web systems & career advice

Case 1: Asynchronous process, not time sensitive, with reserved concurrency

领英推荐

Case 2: Asynchronous process, time sensitive, without reserved concurrency

Victor Moreno的更多文章

社区洞察

其他会员也浏览了

Event Driven AWS Services

Day 33 Tasks: 33 /90 Days

Introduction to Service Discovery

SNS vs SQS vs EventBridge: Choosing the Right Serverless Tool for Event-Driven Architectures

A Step-by-Step Guide to Securely Exposing an API Gateway with AWS Services

CASE STUDY OF AWS SQS

AWS SQS and it's use cases

Use case on AWS SQS

Unleashing the Power of AWS Lambda Functions: A Deep Dive

4 Challenges of Serverless Log Management in AWS

Case 1: Asynchronous process, not time sensitive, with reserved concurrency

领英推荐

Case 2: Asynchronous process, time sensitive, without reserved concurrency

Victor Moreno的更多文章

On Impostor Syndrome and Introspection

Tech Cofounder Vs Fractional CTO

Clean Code: Losing sight of the forest for the trees

Which Kubernetes Service Should I Use For My Startup?

Five Tips for Systems Design Interviews

How to Evaluate Tech Proposals as a Non-Technical Founder

Job Interview Rejections in Perspective

Database Choices for Startups

You Undervalue Developers

Education is Tough to Scale

社区洞察

其他会员也浏览了

Event Driven AWS Services

Day 33 Tasks: 33 /90 Days

Introduction to Service Discovery

SNS vs SQS vs EventBridge: Choosing the Right Serverless Tool for Event-Driven Architectures

A Step-by-Step Guide to Securely Exposing an API Gateway with AWS Services

CASE STUDY OF AWS SQS

AWS SQS and it's use cases

Use case on AWS SQS

Unleashing the Power of AWS Lambda Functions: A Deep Dive

4 Challenges of Serverless Log Management in AWS