登录查看更多内容

How Cool is NVMe ? - Part 3 - No waiting in Queues

John "Ricky" Martin

发布日期: 2017年9月14日

OK, so I think I’ve covered off the throughput benefits well enough, but as I said in my first post the most noticeable benefit that customers see when moving from mechanical disk to solid state media is that there was a 10x – 100x improvement in latency. Again, when we look at the hype surrounding NVMe we hear a lot about the large improvements in latency that comes through the use of that protocol. The following graph from Intel demonstrates the scale of that benefit.

Image from communities.intel.com.

I think we can all agree that mechanical disks are slow, but where it gets interesting is when we compare the controller and software latency of SAS (the light green an purple bits on the graph) vs NVMe (the tiny little dark green bit on the bottom row). Based on the graph above you will see that SAS has about 25 microseconds of protocol latency vs what looks to be about five microseconds of latency for NVMe. That’s an improvement of 500% which is awesome and almost justifies all the hype that surrounds NVMe.

Unfortunately when you then throw in an extra 50 microseconds of drive/media latency that gives you a 80 microsecond vs 60 microsecond comparison .. which works out to about a 30% improvement .. useful, but not nearly as compelling as 500%. Then on top of that, after you’ve run this through the software stack of a modern all flash array, where you’re usually seeing around 200 – 700 microseconds worth of latency, a 20 microsecond improvement really only translates to around 5% - 10% better performance, which is useful, but hardly revolutionary.

Queues, Queue Depths, Littles Law and Beer

OK, so latency is better but not mind-blowing, so how about the amazing increase in queue depth, that’s got to have a BIG improvement, right?

It can, BUT before I go on, its worth explaining why queues and queue depths are a potentially important. Outside of storage performance experts, most people I speak to have no real appreaciation for queue depths and queuing theory, so allow me to indulge in just a little theory and math. There is this thing called “Littles Law” which says that your throughput (which for small block random accesses to media is measured in input and output opeations per second, aka IOPS on a storage array) can be determined by how many things you do at the same time, divided by how long it takes to do those things.

In other words, I can increase my throughput by doing lots of things in parallel, or by reducing the time it takes to do those things, or I could do both. Mathematically it is expressed using the following equation

Wq = Lq / Throughput

It might seem obvious, but proving it was really really clever. If you’re still not convinced or really get the implications, here is a concrete example

If I have one “queue” and each request in the queue took 1 millisecond (a thousandth of a second) to process, then I could process exactly 1000 requests a second
If I have two queues and each request still takes one millisecond then I can process 2000 requests a second
If I have two queues and each request takes half a millisecond then I can process 4000 requests a second
If I have four queues and each request takes half a millisecond I can process 8000 requests a second

If that sounds a little esoteric, let me show you how queueing theory applies to everyday life.

Beer Related Latency

Imagine you're planning a party, where someone says “hey can you go out and grab some beer for the party ?”, and you head out to the nearest beer vendor, grab a six pack, come back with it expecting to relax with a frosty beverage. Then your so-called friends say “dude, we need more than one 6 pack”. So you go back to the beer emporium and get a carton this time, and when you get back they say “we might need some light beer too”. This is what happens when you have a single queue with one command and high service time .. you get frustrated and want to bang your head on a wall. So to save your head, the wall, and your relationship with your friends, you calmly say “lets make a list” as you plan to go out and grab a bunch of different things and take them back to the car (multiple commands one queue). Then you remember your friends who are supposed to be planning the party are making you do all the work, so you take over and create a bunch of lists (multiple queues with multiple commands) and start organising things.

Hurfey ... here’s a list of party supplies,

Snuffy … get the drinks including mixers from this list and don’t forget the tequila,

Biggles … here’s my address book, ring around to make sure everyone is coming,

Corey .. I mean Dhruv … walk up and down the halls and warn the neighbours.

Meanwhile you can sit back in the sure knowledge that everything will get done quickly and the party will be a success and you can sit back and relax with your frosty beverage.

Device Queues and Taxi Ranks at the Airport

Data storage devices are kind of similar, because inside of them is something a bit like the party planner I described above, but in my mind it’s more like the guy who manages the taxi queue in the airport. Let’s call him an intelligent queue manager. He takes people from the head of the queue and assigns them to a taxi bay. Then a bunch of taxis come up and carry the people away, usually a few people more or less at a time. As the taxi’s pick-up people, he grabs more people from the queue and assigns them to more taxi slots.

When this runs well, the queue moves pretty quickly, if there aren’t enough people in the queue to keep the taxi slots filled up then the throughput drops, when the queue gets too long the “service time” between the time you make the request (joining the back of the queue) and you getting into a cab increases (that is what latency is, and why everyone wants the lowest latency possible because waiting sucks). There are even equivalents of the time when the queue gets really full and the queue manager calls out to the queue and says “who else needs to go to the north shore ?” and bundles a bunch of people into the same taxi.

That queue is a bit like a queue in a SAS device, theres only one or two queues and each queue has a depth of about 250 commands. But wouldn’t it be great if you didn’t have to queue at all ? What about an equivalent of where everyone comes out of the airport, pulls out their phone and hails and Uber or Lyft or Go-Jek, in effect you’ve now got tens of thousands of queues. That means there’s no waiting ... right? Well if your experience matches mine, it didn’t make that much of a difference because the bottleneck wasn’t the queue, it was the roads in and around the airport.

Device Queues SAS vs NVMe

Now as I said before with SAS you’re limited to about 256 simultaneous commands per queue that you can send through to a device, and there is at best two queues. Wheras with NMVe that queue depth increases to an amazing 65,000 simultaneous commands that an improvement of over 25,000% that MUST have a major impact, right ? But wait, you can also have 65,000 queues too .. that’s 4,225,000,000 commands … AMAZING. So that has to have an even HUMOUNGOUSER impact !!!!!!

Well it might, one day, but today, not so much. If you look at current benchmark testing of NVMe devices you’ll see that in a lot of cases they hit their best latency and throughput numbers when you hit a queue depth of between sixteen (16) and thirty two (32) simultaneous commands, which is well within the capabilities of SAS’s maximum of 128. Even testing by drive intel bears this out (though this graph doesn't show latency)

Ok, but maybe thats just because NAND can't keep up, or traditional SSD's are still built with SCSI / SAS queue depths in mind, and next generation media like Optane will be when we see the big differences. Unfortunately that doesn't seem to be the case either if the Intel Optane results are anything to go by. This next-gen device seems to max out at a queue depth of about 12 which is something Intel seems to be quite proud of, so it's no accident.

The other thing you see if you dig around a bit on the discussions on queue depths for SSD and NVMe devices, you’ll see that a good number of well qualified commentators say that reduced latency of flash means your device queues never really fill up in the first place outside of one atypical workload (The synthetic workload HammerDB TPC-H)

Even big data workload generators like Terasort rarely generate enough queued traffic onto an SSD to generate more than about 80 queued commands.

The benefits of large queue depths in SSD

The big benefit of bigger queue depths is that it helps to spread the load across more NAND chips (die) .. as intel says here .

"as the queue depth increases, an increasing number of concurrent Flash components are utilized, thus increasing the performance. However, this increase is not linear with queue depth because random access will not distribute perfectly across the multiple dies in the SSD. As queue depth increases, there are more cases of commands landing on the same Flash component. As a result, the performance asymptotically approaches the saturation as the queue depth gets large"

So unless you're building your flash devices using hundreds or even thousands of small fast expensive chips you probably won't see the benefits of the massive number of queues and commands possible with NMMe. Even the latest generation of Intel NVMe attached NAND SSDs recommend a maximum queue depth of about 255, and as I said before 255 also happens to be the top number for a current generation SAS attached device. So while the astounding increase in queued commands available for NVMe certainly helps, it not quite the OMG factor today that a lot of the hype merchants make it out to be. Even if they do build their own custom SSDs from other people's chips, I strongly doubt they're packing more than a hundred or so NAND chips into each device.

I'm here all week, try the fish !

Controller bottlenecks and software latency is exactly what I’d like to talk about next and that will probably involve more than a little snark hunting :-)

John "Ricky" Martin的更多文章

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2b – How to get rid of your memory

2020年3月31日

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2b – How to get rid of your memory

This is really more of a Part 2b, because the last post got to be a little long and I was waiting on some data. In Part…
NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2 – Be Wary of Memory Compression

2020年3月31日

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2 – Be Wary of Memory Compression

In Part 1 of this blog series I promised that I would talk “about the most recent datacenter workloads, including a…

2 条评论
NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 1

2020年3月25日

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 1

Two and a half years ago, I started writing a series of Blogs about “How cool is NVMe”, Part One was called “Snark…

6 条评论
How cool is NVMe – A summary of benefits for external storage arrays

2019年7月24日

How cool is NVMe – A summary of benefits for external storage arrays

I can hardly believe it’s coming up to two years since I last wrote a piece in this series, and a lot has happened…

8 条评论
Better performance measurement for S3 compatible object stores

2019年2月22日

Better performance measurement for S3 compatible object stores

Back in 2003 Intel presented a paper at the Storage Industry Networling Association (SNIA) on the need for benchmarking…
How cool is Software Defined Storage ?

2017年10月26日

How cool is Software Defined Storage ?

Preamble This started out as the final post in my "How cool is NVMe" series, but after multiple drafts, one of which…

3 条评论
The biggest challenge in adopting new technologies

2017年10月24日

The biggest challenge in adopting new technologies

I was recently asked by Daniel Blickling “what, in your experience, is the biggest challenge when it comes to finding…

13 条评论
How cool is NVMe ? – Part 4 – CPU and Software Efficiency

2017年9月19日

How cool is NVMe ? – Part 4 – CPU and Software Efficiency

Intro So, if you’ve read all the other posts in this series, you’ll see that in a modern all flash array, the…

6 条评论
How cool is NVMe ? Part 2 -> Throughput

2017年9月7日

How cool is NVMe ? Part 2 -> Throughput

So I've decided to put down my snark hunt for the moment, and get technical. There are lots of claims about how much…

8 条评论
How cool is NVMe ? Part 1 - Snark Hunting

2017年9月5日

How cool is NVMe ? Part 1 - Snark Hunting

There was an interview with the CEO of a notable storage company who made a rather odd assertion when asked about their…

3 条评论

See all articles

How Cool is NVMe ? - Part 3 - No waiting in Queues

John "Ricky" Martin

Queues, Queue Depths, Littles Law and Beer

Beer Related Latency

Device Queues and Taxi Ranks at the Airport

Device Queues SAS vs NVMe

The benefits of large queue depths in SSD

Further reading

I'm here all week, try the fish !

Other posts in this series

John "Ricky" Martin的更多文章

社区洞察

其他会员也浏览了

GPT As file system

Atomic References with C++20

A Lock-Free Stack: Atomic Smart Pointer

To shard or not-to-shard your vector database

Logical Clocks(II) — Clock Series

SubArrays and CarryForward

NVMe Might Seem Like Rare Generational Breakthroughs. Here’s Why?

LeetCode 189. Rotate Array

CAP Theorem – Demystified

Queues, Queue Depths, Littles Law and Beer

Beer Related Latency

Device Queues and Taxi Ranks at the Airport

Device Queues SAS vs NVMe

The benefits of large queue depths in SSD

Further reading

I'm here all week, try the fish !

Other posts in this series

John "Ricky" Martin的更多文章

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2b – How to get rid of your memory

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2 – Be Wary of Memory Compression

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 1

How cool is NVMe – A summary of benefits for external storage arrays

Better performance measurement for S3 compatible object stores

How cool is Software Defined Storage ?

The biggest challenge in adopting new technologies

How cool is NVMe ? – Part 4 – CPU and Software Efficiency

How cool is NVMe ? Part 2 -> Throughput

How cool is NVMe ? Part 1 - Snark Hunting

社区洞察

其他会员也浏览了

GPT As file system

Atomic References with C++20

A Lock-Free Stack: Atomic Smart Pointer

To shard or not-to-shard your vector database

Logical Clocks(II) — Clock Series

SubArrays and CarryForward

NVMe Might Seem Like Rare Generational Breakthroughs. Here’s Why?

LeetCode 189. Rotate Array

CAP Theorem – Demystified