How Cool is NVMe ? - Part 3 - No waiting in Queues

How Cool is NVMe ? - Part 3 - No waiting in Queues

OK, so I think I’ve covered off the throughput benefits well enough, but as I said in my first post the most noticeable benefit that customers see when moving from mechanical disk to solid state media is that there was a 10x – 100x improvement in latency. Again, when we look at the hype surrounding NVMe we hear a lot about the large improvements in latency that comes through the use of that protocol. The following graph from Intel demonstrates the scale of that benefit.

No alt text provided for this image


Image from communities.intel.com.

I think we can all agree that mechanical disks are slow, but where it gets interesting is when we compare the controller and software latency of SAS (the light green an purple bits on the graph) vs NVMe (the tiny little dark green bit on the bottom row). Based on the graph above you will see that SAS has about 25 microseconds of protocol latency vs what looks to be about five microseconds of latency for NVMe. That’s an improvement of 500% which is awesome and almost justifies all the hype that surrounds NVMe.

Unfortunately when you then throw in an extra 50 microseconds of drive/media latency that gives you a 80 microsecond vs 60 microsecond comparison .. which works out to about a 30% improvement .. useful, but not nearly as compelling as 500%. Then on top of that, after you’ve run this through the software stack of a modern all flash array, where you’re usually seeing around 200 – 700 microseconds worth of latency, a 20 microsecond improvement really only translates to around 5% - 10% better performance, which is useful, but hardly revolutionary.

No alt text provided for this image


Queues, Queue Depths, Littles Law and Beer

OK, so latency is better but not mind-blowing, so how about the amazing increase in queue depth, that’s got to have a BIG improvement, right?

It can, BUT before I go on, its worth explaining why queues and queue depths are a potentially important. Outside of storage performance experts, most people I speak to have no real appreaciation for queue depths and queuing theory, so allow me to indulge in just a little theory and math. There is this thing called “Littles Law” which says that your throughput (which for small block random accesses to media is measured in input and output opeations per second, aka IOPS on a storage array) can be determined by how many things you do at the same time, divided by how long it takes to do those things.

In other words, I can increase my throughput by doing lots of things in parallel, or by reducing the time it takes to do those things, or I could do both. Mathematically it is expressed using the following equation

Wq = Lq / Throughput

It might seem obvious, but proving it was really really clever. If you’re still not convinced or really get the implications, here is a concrete example

  • If I have one “queue” and each request in the queue took 1 millisecond (a thousandth of a second) to process, then I could process exactly 1000 requests a second
  • If I have two queues and each request still takes one millisecond then I can process 2000 requests a second
  • If I have two queues and each request takes half a millisecond then I can process 4000 requests a second
  • If I have four queues and each request takes half a millisecond I can process 8000 requests a second

If that sounds a little esoteric, let me show you how queueing theory applies to everyday life.

Beer Related Latency

Imagine you're planning a party, where someone says “hey can you go out and grab some beer for the party ?”, and you head out to the nearest beer vendor, grab a six pack, come back with it expecting to relax with a frosty beverage. Then your so-called friends say “dude, we need more than one 6 pack”. So you go back to the beer emporium and get a carton this time, and when you get back they say “we might need some light beer too”. This is what happens when you have a single queue with one command and high service time .. you get frustrated and want to bang your head on a wall. So to save your head, the wall, and your relationship with your friends, you calmly say “lets make a list” as you plan to go out and grab a bunch of different things and take them back to the car (multiple commands one queue). Then you remember your friends who are supposed to be planning the party are making you do all the work, so you take over and create a bunch of lists (multiple queues with multiple commands) and start organising things. 

Hurfey ... here’s a list of party supplies,

Snuffy … get the drinks including mixers from this list and don’t forget the tequila,

Biggles … here’s my address book, ring around to make sure everyone is coming,

Corey .. I mean Dhruv … walk up and down the halls and warn the neighbours.

 Meanwhile you can sit back in the sure knowledge that everything will get done quickly and the party will be a success and you can sit back and relax with your frosty beverage.

Device Queues and Taxi Ranks at the Airport

Data storage devices are kind of similar, because inside of them is something a bit like the party planner I described above, but in my mind it’s more like the guy who manages the taxi queue in the airport. Let’s call him an intelligent queue manager. He takes people from the head of the queue and assigns them to a taxi bay. Then a bunch of taxis come up and carry the people away, usually a few people more or less at a time. As the taxi’s pick-up people, he grabs more people from the queue and assigns them to more taxi slots.

When this runs well, the queue moves pretty quickly, if there aren’t enough people in the queue to keep the taxi slots filled up then the throughput drops, when the queue gets too long the “service time” between the time you make the request (joining the back of the queue) and you getting into a cab increases (that is what latency is, and why everyone wants the lowest latency possible because waiting sucks). There are even equivalents of the time when the queue gets really full and the queue manager calls out to the queue and says “who else needs to go to the north shore ?” and bundles a bunch of people into the same taxi.

That queue is a bit like a queue in a SAS device, theres only one or two queues and each queue has a depth of about 250 commands. But wouldn’t it be great if you didn’t have to queue at all ? What about an equivalent of where everyone comes out of the airport, pulls out their phone and hails and Uber or Lyft or Go-Jek, in effect you’ve now got tens of thousands of queues. That means there’s no waiting ... right? Well if your experience matches mine, it didn’t make that much of a difference because the bottleneck wasn’t the queue, it was the roads in and around the airport.

Device Queues SAS vs NVMe

Now as I said before with SAS you’re limited to about 256 simultaneous commands per queue that you can send through to a device, and there is at best two queues. Wheras with NMVe that queue depth increases to an amazing 65,000 simultaneous commands that an improvement of over 25,000% that MUST have a major impact, right ? But wait, you can also have 65,000 queues too .. that’s 4,225,000,000 commands … AMAZING. So that has to have an even HUMOUNGOUSER impact !!!!!!

Well it might, one day, but today, not so much. If you look at current benchmark testing of NVMe devices you’ll see that in a lot of cases they hit their best latency and throughput numbers when you hit a queue depth of between sixteen (16) and thirty two (32) simultaneous commands, which is well within the capabilities of SAS’s maximum of 128. Even testing by drive intel bears this out (though this graph doesn't show latency)

No alt text provided for this image


Ok, but maybe thats just because NAND can't keep up, or traditional SSD's are still built with SCSI / SAS queue depths in mind, and next generation media like Optane will be when we see the big differences. Unfortunately that doesn't seem to be the case either if the Intel Optane results are anything to go by. This next-gen device seems to max out at a queue depth of about 12 which is something Intel seems to be quite proud of, so it's no accident.

No alt text provided for this image

The other thing you see if you dig around a bit on the discussions on queue depths for SSD and NVMe devices, you’ll see that a good number of well qualified commentators say that reduced latency of flash means your device queues never really fill up in the first place outside of one atypical workload (The synthetic workload HammerDB TPC-H)

No alt text provided for this image


Even big data workload generators like Terasort rarely generate enough queued traffic onto an SSD to generate more than about 80 queued commands.

The benefits of large queue depths in SSD

The big benefit of bigger queue depths is that it helps to spread the load across more NAND chips (die) .. as intel says here .

"as the queue depth increases, an increasing number of concurrent Flash components are utilized, thus increasing the performance. However, this increase is not linear with queue depth because random access will not distribute perfectly across the multiple dies in the SSD. As queue depth increases, there are more cases of commands landing on the same Flash component. As a result, the performance asymptotically approaches the saturation as the queue depth gets large" 

So unless you're building your flash devices using hundreds or even thousands of small fast expensive chips you probably won't see the benefits of the massive number of queues and commands possible with NMMe. Even the latest generation of Intel NVMe attached NAND SSDs recommend a maximum queue depth of about 255, and as I said before 255 also happens to be the top number for a current generation SAS attached device. So while the astounding increase in queued commands available for NVMe certainly helps, it not quite the OMG factor today that a lot of the hype merchants make it out to be. Even if they do build their own custom SSDs from other people's chips, I strongly doubt they're packing more than a hundred or so NAND chips into each device.

Further reading

If you’re really interested there is a pretty good discussion of the impact of the increased queue depth that comes as part of the NVMe interface specification in computer weekly here https://www.computerweekly.com/feature/Storage-101-Queue-depth-NVMe-and-the-array-controller  which I might summarise as “Yep its amazing, and means you don’t even need to really think about tuning stuff for queue depths any more, they give you so much headroom, you actually need to do less engineering to get the best out of your device not more, and in any case as amazing as they are the device queue depth isn’t really relevant if the storage controller is a bottleneck”.

I'm here all week, try the fish !

Controller bottlenecks and software latency is exactly what I’d like to talk about next and that will probably involve more than a little snark hunting :-)

Other posts in this series

James Croyle

Web3 Builder | C Suite | Strategic Partnerships | Explosive Growth Leader | ex Microsoft, Check Point, IBM

7 年

This is an absolutely excellent write up John Martin ! Your examples make it extremely easy to understand the technical landscape of NVMe.

回复
Jean-Fran?ois Marie ????????

Inspiring and helping others finding their ways. Helping to leave a better place for our children. Associate at Time for the Planet

7 年

Great John! And the next bottleneck will be the apps which do not embrace well parrallelism... fun to see how it will go, but it reminds me the time when we moved from single threaded CPU to multithreaded...

回复
Christopher Benecke

Global Key Account Director Google Cloud - Inspiring Innovation with Cloud

7 年

Sensational - thanks for this great post

回复

要查看或添加评论,请登录

John "Ricky" Martin的更多文章

社区洞察

其他会员也浏览了