Some Thoughts on Monitoring SQS
The following all apply to a company that has not yet adopted the practice of microservices, or if they have, they have not yet broken up the tenancy of said microservices (ie. All the systems live in the same AWS region and account).
Case 1: Asynchronous process, not time sensitive, with reserved concurrency
An example of this use case would be sending metrics to mixpanel. Typically you don't want to bog down your client-side or even server-side applications with latency of sending a metric to mixpanel synchronously. If you do this on the server side, it adds a little bit of latency. If you do it on the client side, there's the latency issue too but what's even worse is that you risk losing metrics whenever your users refresh their page or quit before the metric is sent. So, what you typically do is have a very fast endpoint on your backend that acknowledges receipt of the request and fires it off to sns (it might even do that asynchronously). Then there's an sqs queue connected to sns, and finally a lambda consumer for the sqs queue. Queues of this nature should have reserved concurrency limits specified so that there's no possibility of them choking your critical synchronous workloads by consuming your lambda concurrent execution limits.
Metrics to monitor:
领英推荐
Case 2: Asynchronous process, time sensitive, without reserved concurrency
An example use case here might be something like a long-running process initiated by clients, where the client fires off a request to do a thing, the thing usually takes 10-45 seconds, the backend does the thing asynchronously, and the client polls the backend for status while the thing is running. Think of something like provisioning a new resource, manual test runs in CI, etc.
In this case, since your process is time sensitive, you can't actually put a reserved concurrency limit on it. Because of this, the first two monitors mentioned above are much more limited in usefulness (you should still have them because they will let you know when you've hit your global lambda invocation limit per region per account). So you'll still have all the same nonitors as above, but the main one you wanna focus on is ApproximateAgeOfOldestMessage.?If this spikes beyond a few seconds, something is very wrong, and you're not doing the thing your customers want you to do for them.
Case 3: The queue is consumed by a non-elastic consumer
In some cases, queue consumers have to share so much code with your app server that it makes sense to put them in the same code package as your backend. In these cases, they'll probably not be deployed to lambda (this would be a cold start nightmare) but instead something like ECS or EC2. While it is possible to set up autoscaling in these scenarios by looking at cpu and memory utilization, many companies don't do it. If this is the case for some of your queues, you still want all the monitors above except for lambda throttles, and you want to replace that one with a monitor on consumer capacity utilization. If your setup allows you to, the ideal monitor here would be on the percentage of active queue consumer threads with respect to total capacity. If not, then you can use P90 memory utilization of your consumer cluster as a proxy.
To expand more on the above, let's say your consumer cluster is 16 containers, each one running sqs consumers with 64 threads of capacity. Thus, your total capacity for consuming messages is 16*64 = 1,024 simultaneous executions. If you can have a monitor when actual utilization reaches 75% of 1024, that would be ideal. If you can't then alarm on a more general memory metric, such as jmx heap utilization if you're using a java backend. As a last resort you can alarm on the memory utilization of the container/VM itself, but in many setups this alarm would rarely if ever fire. This is because in many cases people don't take the time to tune the resource allocation of their language runtime to max out the capacity of their server. So you might have a NodeJS server with 4gb of ram but your version of node has a default max heap size of 1gb. So if you alarm on the VM's memory utilization, that alarm is only gonna fire if someone hits you with a mining virus or something.