When to use a Queue
Brett Flegg
Drawing boxes and arrows on whiteboards and writing documents nobody reads
I conduct many systems design interviews, and I have recently noticed that candidates seem to have an unnatural infatuation for queues1. Not that I have a problem with queues. I love queues. They have all sorts of uses in systems design, but they also create many challenges. So, if you are going to include a queue in your design, you better be ready to explain what problems it solves and talk about how you intend to mitigate the challenges it introduces.?
Here is a super quick primer on queues for my non-technical Dad (who diligently reads each of these articles and reports back on the slightest typo). In computer science, a queue is a first-in-first-out data structure, just like its namesake – the British queue (aka a "line" for Americans). Items can be added to the queue's end and removed from the front - no cutting allowed. Implementing a reliable, fault-tolerant, performant, and scalable distributed queue is a big undertaking. Fortunately, there are many great off-the-shelf options, from simple storage queues like Azure Storage Queues to complex message broker services like Apache ActiveMQ and Google Pub/Sub . These implementations often include lots of additional bells and whistles (dequeue leases , support for multiple subscribers , etc.). Still, at the most basic level, they allow one or more producers to add items to the queue (enqueue) and one or more consumers to remove items from the queue (dequeue).
Modern microservice architectures make extensive use of queues (more on this later), but even software that is probably older than most of my readers used queues. Almost three decades ago, Systems Management Server 1.0 used a file-based inbox model to serialize writes to its database. In this initial incarnation, clients would write files to an open file share on Client Access Points which would then be batch copied to a central site server that would process each of the files and update the database2 — a queue with multiple producers and a single consumer.
An exhaustive list of the uses of queues in modern microservice architectures is beyond the scope of this article, but let's talk about a few key use cases.?
Distribute & balance work - When you order a coffee at Starbucks, the cashier (front-end) will take your order and place a cup with your name on the counter (the queue) for the next available barista (back-end) to service. Some drinks can be made quickly (drip coffee), and some take much longer (double ristretto venti half soy nonfat decaf organic chocolate brownie iced vanilla double-shot Frappuccino?), but this complexity is abstracted away from each cog in the system. The cashiers simply need to take orders as quickly as possible, and each barista needs to make each drink. This system has some cool features:
Security Isolation - My first bank used a message queue in the form of pneumatic tubes to isolate the tellers (front-end) from the vault (back-end). When you walked in, tellers would meet with you in the front of the bank and write down your deposit or withdrawal information on a slip of paper, place it in a canister and send it whizzing to the vault in the back of the building. In the vault, another worker was waiting to service the request and would send back the deposit slip or cash through a return tube. The system ensured that even if the front-ends were compromised (i.e., a hold-up), the back-end would remain secure3. We still use this pattern today in software design - allowing front-end servers to issue read requests directly but forcing write operations through a message queue to limit the blast radius of attacks.
Protection for service outages and failures - As traditional monolithic services get spit into more-and-more microservices, each of which can (and will) fail, queues provide a critical buffer to mitigate temporary service outages. Without queues, a failure in one component can quickly cascade and bring down the whole service.??
Reduce coupling - Finally, message queues like Google Pub/Sub are excellent tools to reduce coupling between components. They enable producers to signal that an event has occurred without worrying about consumers. The solution to the classic 'Design Twitter' systems design problem leverages this. It uses a message queue to notify interested parties (including services that build indexes, send mails, etc.) that a new tweet has been created. Subscribers can be added/removed anytime without having to rewrite the core service.??
So, if queues have all these advantages, why don't we use them everywhere? As Knuth said , "Premature optimization is the root of all evil (or at least most of it) in programming." The British like to stand in queues, but even they have their limits – you won't see a formal line at Tesco to grab a box of cereal off the shelf. Creating and managing queues has a cost and can add a lot of overhead for some operations (if you have ever had to zig-zag through the stanchion and rope guides at an empty airport, you have experienced this first hand). Queues have complexity for both the service owner and the caller.
领英推荐
In most cases, the use of a queue in a service is not directly exposed to the caller – it tends to be hidden behind an asynchronous API or long-running operation . And while languages and frameworks have gotten pretty good about abstracting this, it is still a lot easier to get the actual response instead of a ticket we need to poll. There are a lot more hidden complexities on the producer side of things.??
Measuring Quality of Service (QoS) - Let's go back to the Starbucks example. If all we measure is how long it takes the cashier to ring up the transaction or the time it takes for a barista to make a drink, we have lost sight of how our customers experience the system: the total time it takes them to get their drink. Unfortunately, though, all too often, service owners ignore this fact. When you introduce queues into your system's design, you must also invest in a system to track and report on the end-to-end performance. Doing this well is not simple – and made even more complex when it is not considered upfront and needs to be retrofitted to an existing design.
Context is lost - Managing context across queues goes hand-in-hand with measuring QoS - but there is more to it. Your Starbucks cashier writes your name and your order on your cup – but sometimes this breaks down (anyone named James or Mary probably knows what I am talking about ). As systems design engineers, we must be cautious about preserving identity, security context, and troubleshooting information. Service frameworks usually handle these sorts of things automatically when we make direct calls. When we use queues, we typically need to manage them ourselves – which can be quite tricky.
Queue backlogs - One of the worst live-site incidents I was involved with was due to an improperly bounded queue. As discussed above, queues allow us to smooth out peak load – but they aren't a silver bullet for scale issues. A large, sustained load beyond what the back-end can handle will result in a queue growing and growing. When designing a system with queues, we need to put a bound on the queue size, ensure our front-ends gracefully handle 'queue full' messages, and test our back-ends to know how long it will take to service a full queue. If you don't do this, make sure you are good at apologizing – like me, you may be sent on an apology tour and be asked to explain to each of your customers why the system went down.
Messages get lost and/or duplicated - This subject probably requires its own article. All too often, I find engineers believe message queues are infallible. They point to the documentation that states that order is guaranteed and that messages are never lost. But even if this was true in principle - it rarely is in practice. While the queue itself may not mess up, you can bet your bonus that something else will go wrong (either a coding bug in the producer/consumer, a disaster recovery drill, or an overly tired SRE manually deleting messages from the queue at 2 am in a desperate attempt to get the service back online). These practical realities shouldn't be ignored – and we need to build functionality into our system to detect and repair inconsistencies.
So, the next time you think through the design of a component you are developing for work (or a systems design interview), make sure you think through the pros and cons. If it still makes sense to use a queue, take the time to think through the challenges it introduces.
Be Happy!
Like this post? Please consider sharing, checking out my other articles , and subscribing to my weekly Flegg’s Follies newsletter for more articles on software engineering and careers in tech.
Footnotes:
Please note that the opinions stated here are my own, not those of my company.
Engineering Lead at Google
2 年Brett Flegg
Senior Software Engineer @ Microsoft | PhD, Computational Statistics
2 年Brett, I am curious, where do you stand on the whole "eventual consistency" thing? On one end of the spectrum we have the people who say nobody cares if some of your friends see the picture of a cat 12h late. On the other hand, I recently learned that if your digital audio system drops a packet, the $100k speakers that are playing audio to an audience of 50k people will not only permanently become not-speakers but might give your audience hearing damage.
Great read