When to use a Queue

When to use a Queue

I conduct many systems design interviews, and I have recently noticed that candidates seem to have an unnatural infatuation for queues1. Not that I have a problem with queues. I love queues. They have all sorts of uses in systems design, but they also create many challenges. So, if you are going to include a queue in your design, you better be ready to explain what problems it solves and talk about how you intend to mitigate the challenges it introduces.?

Here is a super quick primer on queues for my non-technical Dad (who diligently reads each of these articles and reports back on the slightest typo). In computer science, a queue is a first-in-first-out data structure, just like its namesake – the British queue (aka a "line" for Americans). Items can be added to the queue's end and removed from the front - no cutting allowed. Implementing a reliable, fault-tolerant, performant, and scalable distributed queue is a big undertaking. Fortunately, there are many great off-the-shelf options, from simple storage queues like Azure Storage Queues to complex message broker services like Apache ActiveMQ and Google Pub/Sub . These implementations often include lots of additional bells and whistles (dequeue leases , support for multiple subscribers , etc.). Still, at the most basic level, they allow one or more producers to add items to the queue (enqueue) and one or more consumers to remove items from the queue (dequeue).

No alt text provided for this image

Modern microservice architectures make extensive use of queues (more on this later), but even software that is probably older than most of my readers used queues. Almost three decades ago, Systems Management Server 1.0 used a file-based inbox model to serialize writes to its database. In this initial incarnation, clients would write files to an open file share on Client Access Points which would then be batch copied to a central site server that would process each of the files and update the database2 — a queue with multiple producers and a single consumer.

An exhaustive list of the uses of queues in modern microservice architectures is beyond the scope of this article, but let's talk about a few key use cases.?

Distribute & balance work - When you order a coffee at Starbucks, the cashier (front-end) will take your order and place a cup with your name on the counter (the queue) for the next available barista (back-end) to service. Some drinks can be made quickly (drip coffee), and some take much longer (double ristretto venti half soy nonfat decaf organic chocolate brownie iced vanilla double-shot Frappuccino?), but this complexity is abstracted away from each cog in the system. The cashiers simply need to take orders as quickly as possible, and each barista needs to make each drink. This system has some cool features:

  • The model enables the system to 'smooth out' the peak load by quickly and efficiently handling the intake of new requests (thus preventing the line from getting too long)
  • We can dynamically scale our front-end and back-end as needed (e.g., if the line goes out the door, we can add more cashiers – if people are ordering lots of fancy drinks, we can add more workers).
  • Workers can 'look ahead' at the queue to find simple optimizations (e.g., steam milk for multiple orders at once).
  • Workers can 'specialize' in specific tasks (e.g., a worker especially good at making Frappuccinos can selectively service these orders).

Security Isolation - My first bank used a message queue in the form of pneumatic tubes to isolate the tellers (front-end) from the vault (back-end). When you walked in, tellers would meet with you in the front of the bank and write down your deposit or withdrawal information on a slip of paper, place it in a canister and send it whizzing to the vault in the back of the building. In the vault, another worker was waiting to service the request and would send back the deposit slip or cash through a return tube. The system ensured that even if the front-ends were compromised (i.e., a hold-up), the back-end would remain secure3. We still use this pattern today in software design - allowing front-end servers to issue read requests directly but forcing write operations through a message queue to limit the blast radius of attacks.

Protection for service outages and failures - As traditional monolithic services get spit into more-and-more microservices, each of which can (and will) fail, queues provide a critical buffer to mitigate temporary service outages. Without queues, a failure in one component can quickly cascade and bring down the whole service.??

Reduce coupling - Finally, message queues like Google Pub/Sub are excellent tools to reduce coupling between components. They enable producers to signal that an event has occurred without worrying about consumers. The solution to the classic 'Design Twitter' systems design problem leverages this. It uses a message queue to notify interested parties (including services that build indexes, send mails, etc.) that a new tweet has been created. Subscribers can be added/removed anytime without having to rewrite the core service.??

So, if queues have all these advantages, why don't we use them everywhere? As Knuth said , "Premature optimization is the root of all evil (or at least most of it) in programming." The British like to stand in queues, but even they have their limits – you won't see a formal line at Tesco to grab a box of cereal off the shelf. Creating and managing queues has a cost and can add a lot of overhead for some operations (if you have ever had to zig-zag through the stanchion and rope guides at an empty airport, you have experienced this first hand). Queues have complexity for both the service owner and the caller.

In most cases, the use of a queue in a service is not directly exposed to the caller – it tends to be hidden behind an asynchronous API or long-running operation . And while languages and frameworks have gotten pretty good about abstracting this, it is still a lot easier to get the actual response instead of a ticket we need to poll. There are a lot more hidden complexities on the producer side of things.??

Measuring Quality of Service (QoS) - Let's go back to the Starbucks example. If all we measure is how long it takes the cashier to ring up the transaction or the time it takes for a barista to make a drink, we have lost sight of how our customers experience the system: the total time it takes them to get their drink. Unfortunately, though, all too often, service owners ignore this fact. When you introduce queues into your system's design, you must also invest in a system to track and report on the end-to-end performance. Doing this well is not simple – and made even more complex when it is not considered upfront and needs to be retrofitted to an existing design.

Context is lost - Managing context across queues goes hand-in-hand with measuring QoS - but there is more to it. Your Starbucks cashier writes your name and your order on your cup – but sometimes this breaks down (anyone named James or Mary probably knows what I am talking about ). As systems design engineers, we must be cautious about preserving identity, security context, and troubleshooting information. Service frameworks usually handle these sorts of things automatically when we make direct calls. When we use queues, we typically need to manage them ourselves – which can be quite tricky.

Queue backlogs - One of the worst live-site incidents I was involved with was due to an improperly bounded queue. As discussed above, queues allow us to smooth out peak load – but they aren't a silver bullet for scale issues. A large, sustained load beyond what the back-end can handle will result in a queue growing and growing. When designing a system with queues, we need to put a bound on the queue size, ensure our front-ends gracefully handle 'queue full' messages, and test our back-ends to know how long it will take to service a full queue. If you don't do this, make sure you are good at apologizing – like me, you may be sent on an apology tour and be asked to explain to each of your customers why the system went down.

Messages get lost and/or duplicated - This subject probably requires its own article. All too often, I find engineers believe message queues are infallible. They point to the documentation that states that order is guaranteed and that messages are never lost. But even if this was true in principle - it rarely is in practice. While the queue itself may not mess up, you can bet your bonus that something else will go wrong (either a coding bug in the producer/consumer, a disaster recovery drill, or an overly tired SRE manually deleting messages from the queue at 2 am in a desperate attempt to get the service back online). These practical realities shouldn't be ignored – and we need to build functionality into our system to detect and repair inconsistencies.

So, the next time you think through the design of a component you are developing for work (or a systems design interview), make sure you think through the pros and cons. If it still makes sense to use a queue, take the time to think through the challenges it introduces.

Be Happy!

Like this post? Please consider sharing, checking out my other articles , and subscribing to my weekly Flegg’s Follies newsletter for more articles on software engineering and careers in tech.

Footnotes:

  1. I haven't seen it, but I suspect some systems design interview book or "interview guru" YouTuber is telling everyone to include queues in their Google Systems Designs to prove they understand how to make things scale. If you come across it – please include it in the comments (although doing so is probably a big warning sign to your boss that you have been reviewing a lot of interview prep material).?
  2. When Azure started rolling out internally at Microsoft as the way of writing services, I was worried I would have to relearn how to build large-scale services. I distinctly remember feeling excited when I realized web roles, worker roles, and storage queues were just fancy new names for components I had been using for a decade.
  3. In retrospect, I am not sure how well the pneumatic tube system worked to dissuade bank robbers. Is a worker in the back of the building going to sit idly by while a bank robber threatens a cashier in front? That said – I have found evidence it worked in at least one case .

Please note that the opinions stated here are my own, not those of my company.

Guy Roth

Engineering Lead at Google

2 年
  • 该图片无替代文字
Diman Todorov

Senior Software Engineer @ Microsoft | PhD, Computational Statistics

2 年

Brett, I am curious, where do you stand on the whole "eventual consistency" thing? On one end of the spectrum we have the people who say nobody cares if some of your friends see the picture of a cat 12h late. On the other hand, I recently learned that if your digital audio system drops a packet, the $100k speakers that are playing audio to an audience of 50k people will not only permanently become not-speakers but might give your audience hearing damage.

要查看或添加评论,请登录

Brett Flegg的更多文章

  • Getting Old(er)

    Getting Old(er)

    When I first started my professional career, it was hard to envision what it would be like to have a life-long career…

    7 条评论
  • A Tough Year to Graduate

    A Tough Year to Graduate

    Summer internships are wrapping up, and rising seniors1 are heading back to school for their final year. All signs…

    3 条评论
  • The Joys and Sorrows of Soft Delete

    The Joys and Sorrows of Soft Delete

    If you are browsing the ConfigMgr database schema (a perfectly normal Sunday afternoon activity for at least some of…

  • Dress like DJam Day

    Dress like DJam Day

    I am on vacation this week, so just a super short article to remind everyone that this coming Saturday, August 13th is…

    5 条评论
  • Synthetic Transactions

    Synthetic Transactions

    At Google, we call them probers; at Microsoft, they are called runners; more generically, they are synthetic…

    16 条评论
  • Seagull Management

    Seagull Management

    One of the favourite parts of my job that the pandemic took away was the chance to walk through team rooms at the end…

    6 条评论
  • Consistency Checkers

    Consistency Checkers

    In my article on queues, I alluded to one of the mistakes I often see developers make in modern microservices design:…

    1 条评论
  • Optimal Stress

    Optimal Stress

    In this week’s article, I will discuss stress and its relationship to productivity. A couple of important disclaimers:…

  • The Sun Never Sets on Software Development

    The Sun Never Sets on Software Development

    Heads-up. If I am interviewing you for an L7 product management position at Google, I will probably ask how you would…

    4 条评论
  • Why Enterprise Software?

    Why Enterprise Software?

    The banner image in today's post popped up in my' memories' feed a couple of days ago. It was taken nine years ago at…

社区洞察

其他会员也浏览了