Making gamers happier, one nudge at a time
A Nudge Visualization

Making gamers happier, one nudge at a time

Queueing Theory: A Brief introduction to Nudge

This article was original published in my personal blog.


I've always been intrigued by queueing theory. We are intuitively familiar with queueing in our day-to-day life, like waiting in line at the bank or in the car waiting for our turn to pass a four-way stop.

I had the opportunity to focus on Queueing for Luna, Amazon's Cloud Gaming Platform. If you've ever played online games, you're familiar with waiting in a queue to find a match or for the game to launch.

Waiting can be frustrating and leads to disengaged gamers. That's why reducing the time it takes to get to the fun stuff is essential.

@Benji-Sales Overwatch 2 release: In Queue: 50,000 Players ahead of you

I ran into this video from SIGMETRICS 2021 about Nudge while exploring “Time in Queue” optimizations.

Nudge is a job scheduling policy that is an optimization on top of First-Come-First-Serve (FCFS) or First-In-First-Out (FIFO). I was immediately intrigued.

FIFO/FCFS is a widely used scheduling policy in the industry due to its performance and simplicity. It is available in most clouds, like AWS SQS, Azure Queues , and RabbitMQ.

FIFO does not require knowledge about the size or complexity of processing the items in the queue. It has minimal overhead while ensuring that no job is left waiting indefinitely, making it a good choice for ensuring fairness among all jobs.

Nudge

Here’s how Nudge works:

When a small job arrives in the queue, the scheduler checks for the job immediately ahead of it. If that job is bigger, we swap their position in the queue and prioritize the smaller job. There’s a limit. A bigger job can only be “jumped” once.

This is a Nudge Scheduling Queue animation I built, it has only two types of jobs, small and big. Each job shows its “arrival timestamp” (in mm:ss.ms), making it easier to confirm when a job is older than another job. Finally, jobs that arrive closely together tend to have a similar color. That way, when a job “jumps” over another, it looks out of place:

Here is the interactive version of the animation, and the source code.

Comparing Nudge and FIFO

I wanted to see how Nudge compares to FIFO in terms of reducing wait times for most users while still being fair.

I made a simulation that models a more complex gaming scenario to find out. The goal is to see if Nudge’s benefits in getting gamers to play the game faster (the fun part) are significant enough to justify a complete implementation.

The simulation has four games TunicHollow KnightLost Ark, and World of Warcraft. It models traffic with imaginary values where each “Game Activity” represents someone that plays the game on some imaginary platform, so they download, install and launch the game.

Nudge uses “job size” to decide when to nudge. I used the size on disk for each game as a proxy metric for the time it’ll take to finish each one of the steps until we have the game running.

In queueing theory, the size usually represents the time it takes to dequeue and successfully process a job.

Rules of the Game

In the simulation, small games are 10GB, and big games are 50GB or more. There are four games:

I used the concept of “Game Runners” that download , install and launch the game (ignoring launch time to keep it simple). A typical run with a thousand gamers, would get a distribution similar to this:

Game Distribution
   "Tunic": 556,
   "World of Warcraft": 186,
   "Lost Ark": 168,
   "Hollow Knight": 90:

This means gamers played Tunic about 6x more than Hollow Knight, and Lost Ark and World of Warcraft were approximately 2x more popular than Hollow Knight.

The “Game Activity” keeps track of the activity’s metadata:

  • When the user “clicks play” to launch the game.
  • The time it took to find a runner for the game.
  • The time it took to download and install the game.
  • If the activity was nudged or caused a nudge.

Each activity launches simultaneously on a FIFO scheduler and the Nudge scheduler. Both use the same list of games and the same configuration values.

Initial Results

I ran the simulation multiple times with 15,000 gamers using the same distribution as above. Here are the summary results of one of the runs:

No alt text provided for this image

It didn't show a big difference for all games’ average, best or worst time in queue. Even when the Nudge scheduler "nudged" 2566 times.

The average time in queue for Tunic (the smallest and most popular game) is 1.751s (Nudge) and 1.752 (FIFO). World of Warcraft (biggest game) is tied in both at 30.554s.

Different runs showed slightly different results but nothing widely different. Two things that I was expecting to see:

  • Much longer times for WoW since it’s technically de-prioritizing it.
  • Slightly shorter time for Tunic on the worst time in queue.

Plotting Game Activity

I wanted to visualize the Game Activity metadata to understand what was happening, and things got interesting again:

No alt text provided for this image
Time to play in Seconds — Nudge (blue) vs FIFO (green)

The chart above reveals that Nudge (blue line) generally results in shorter wait times (or ‘time to play’) than FIFO ( green line).

No alt text provided for this image
Time to play in Seconds per game — Nudge
No alt text provided for this image
Time to play in Seconds per game — FIFO

The scatter plots above compare the performance of Nudge and FIFO per game. Each game is represented by a different color, and the circles’ width corresponds to the game’s size. It shows the relationship between each player’s arrival time and their time to play.

Nudge effectively reduces wait times for smaller gameswith little to no impact on larger games. In fact, Nudge has a better ‘tail’ (lower wait times) for all our gamers.

How exciting!

Time to play distributions

This was the best part for me, it clearly shows what I was expecting to see in the summary results:

No alt text provided for this image
Distribution of time to play, broken down by game — Nudge

This paints a better picture of how Nudge works and gives us a better understanding of how it affects our scheduler’s “long tail” behavior.

Understanding Nudge

Let’s break it down only with Tunic (in red) and World of Warcraft (in green):

No alt text provided for this image
Frequency Distribution for FIFO

No alt text provided for this image
Frequency Distribution for Nudge

The bars represent the frequency of gamers’ wait times. As we run out of resources, the wait time increases, thus the wait time increases over time.


No alt text provided for this image
The Nudge effect in "Time to play"

For Tunic, Nudge significantly increases the number of players who waited less than 200ms to play (1).

In contrast, for World of Warcraft, the number of players waiting less than 100ms is lower when using Nudge (2), and decreases further between the 100–200ms range. It's pushing the bulk of waiting time to the 200–400ms range but reducing the total number of players who needed more than 800ms (3).

The data shows that the distribution tails (representing the gamers with the worst wait times, like the one on the initial tweet) are thinner with Nudge (4).

Nudge generally performs better, reducing the worst wait times for all players.

Outro

I’m confident that Nudge generally improves waiting times for this use case and is especially effective at reducing “tail latency” (the longest waiting times). From that perspective, it makes sense to experiment further.

However, implementing Nudge at scale might be a different feat altogether. Existing queue services like SQS do not provide a way to “bring your own scheduler”. Implementing Nudge would require adding an abstraction layer on top of the queueing system or replacing it entirely.

I think there are two abstractions that are not terribly complicated:

  • Implementing a database-backed queue like this one.
  • A simplified version of a database-backed queue, using a table to track the state of the jobs. Query the table to implement “peeking”, then implement Nudging by “tomb-stoning” the nudged entry on the database so it’s a no-op, and insert it again with a new id after the item it just arrived to represent the state after nudging.

Then there’s RabbitMQ. It has support for plugins, and there are existing plugins that implement prioritization, building a nudge plugin might be another solution.

The next steps for me are exploring these abstractions.


Stay tuned, and thanks for reading! Now, coffee time. Enjoy!


Learning More ??

  • Dive deeper into Nudge with the video introduction, the full paper, and proof that even with weaker conditions Nudge is still better than FIFO for light-tailed job distributions (like the one shown above).
  • Explore how distributed systems rely on queueing to handle high loads and spikes in traffic on Amazon Builder’s Library.
  • Learn about the Queue-Based Load Leveling pattern on Microsoft’s Cloud Design Patterns.
  • Read about load balancing, queueing and the power of two random choices on Nginx’s blog.




Payam Azadi

Engineering Leader | Inclusivity Leader

2 年

This is pretty interesting Eduardo. I’ve thought about what is apparently nudge, but not sure in what scenarios. I think the common one is, there’s one person being checked out at the grocery store, one person in queue with 100 items, and another person behind them with one item. If the 100-item-person is aware and thoughtful and kind, they would let the 1-item-person go ahead. But I didn’t completely understand the simulation you did or the results, probably because I don’t understand the problem statement. Why do gaming queues (eg for Overwatch) get so big? Is it that there isn’t enough gameplay server capacity to host the games? Is it that the matchmaking engine (eg to match players of similar skill) is not able to respond fast enough to the load? Something else? I do love the concept, as I alluded, of smaller jobs bumping the queue up to a limit so that the system can get done what it can when it can. The obvious risk (without constraining number of bumps) is the big jobs never get done because there are enough small jobs to occupy all capacity.

回复

要查看或添加评论,请登录

Eduardo Romero的更多文章

社区洞察

其他会员也浏览了