How cool is NVMe ? – Part 4 – CPU and Software Efficiency
Intro
So, if you’ve read all the other posts in this series, you’ll see that in a modern all flash array, the differences between NVMe attached devices and the more typical dual ported SAS drives are useful incremental improvements, but not exactly the orders of magnitude wow factors we saw when moving from magnetic media to SSD.
But when I talk to real storage engineers, the ones that actually invent and make the products that we install in our datacentres, they are really enthusiastic about NVMe and the thing they seem to be the most excited about is the software stack that drives it.
Now I’ve got access to some pretty deep documentation on how this all works and to be honest I don’t really grok most of it, but the main things I keep seeing are
- Much less code needed to move data between the device and the CPU and some special handling that appears to deliver significant (order or magnitude) improvements in handling small block I/O
- More intelligent queues
- Better device discovery
- Very small single digit microsecond differences in latency between I/O on the local PCIe bus and running the same I/O over Fibre Channel or various fabrics based on RDMA (Remote Direct Memory Access) such as Infiniband or 10/25/100Gbit Ethernet.
But what about the NVMe SSD’s aren’t they exciting and inherently better than SAS SSD’s and isn’t that going to make the biggest difference ?
Firstly the snarky pedant in me wants to say that NVMe is a communication protocol, not a media type and that there is no such thing as an NVMe SSD, there are just SSDs which are attached via the NVMe protocol. I know I'll lose that battle, as the term is now fixed in peoples minds and I doubt that will change so I won't say anything more about it. Now I have that off my chest, the evidence is pretty clear that NVMe attached devices have significantly better potential per device throughput by using at least 4 NVMe “lanes” vs the maximum of 2 SAS lanes in a dual ported SAS drive. Unfortunately, I can’t find much in the way of decisive evidence that future versions of normal MLC/TLC based NAND devices connected by NVMe are going to be able to do significantly more IOPS per device than todays SAS counterparts despite the potential benefit of supporting tens of thousands of intelligent queues. Having said that, a modern enterprise SSD is an amazing device, there is more processing power, RAM and software logic in one SSD than there used to be in a million-dollar array controller from not so long ago. That’s partly why even though they use similar kinds of flash chips, they perform a bajillion times better than your USB flash drive. Would those, or do those devices also benefit from the simplified NVMe software stack? I can’t say for certain, but my theory crafting says that they should.
The more exciting NVMe based devices over the next few years will be really high-end devices, especially the ones using next generation persistent memory like Samsung Z-NAND and Intel Optane. I suspect they will have higher spec and more expensive supporting hardware like faster internal busses and higher clock speed chips. That might generate more heat, reducing the number of memory chips you can fit into a given device without thermal overload. That might also explain why NVMe drives are typically lower capacity and higher priced than their SAS cousins today. Having said that googling around hasn’t turned up anything conclusive, so if anybody knows for certain let me know in the comments section :-)
Less Code means more CPU to do other things.
While part of me would love to dive into the benefits of user-mode polling and register handling and all kinds of other cool deep systems level software engineering, I wont, because I’m really not experienced enough to comment intelligently. Sometimes a little knowledge can be a very dangerous thing, and the most complex thing I’ve had the fun to play around with and program recently was a 9-way motion detection chip attached to an Arduino that my wife let me borrow after she’d finished using it for one of her master’s projects. If you do geek out on that level of software engineering, I’d strongly recommend you head over to www.spdk.io where you’ll see stuff like this.
“The bedrock of SPDK is a user space, polled-mode, asynchronous, lockless NVMe driver. This provides zero-copy, highly parallel access directly to an SSD from a user space application.”
Ha ! and people tell me that I get too technical. The business implication of much better CPU utilisation in the storage driver is that servers accessing SSD via the NVMe protocol get to do more ‘useful’ work doing application level things. This will come in very handy for I/O intensive applications like real time analytics that not only want to access lots of data really quickly, and the NVMe software stack will give them a lot more CPU cycles to do interesting things with that data more quickly. It’s a virtuous cycle that delivers better results more quickly while driving the demand for ever more I/O.
How good is the CPU improvement in the NVMe software stack ?
How good is this improvement ? Well, a 3X – 10X improvement in CPU utilisation for small to medium sized I/O thanks to a much more elegant software stack sounds pretty darn good to me. To back that up, I’m going to refer back to both the Intel whitepaper and the Tom’s IT Pro web-pages that I used before. The first couple of diagrams come from the intel white-paper here and shows the relative complexity of the AHCI stack used for both SAS and SATA
That last graph from the Tom’s IT Pro website is the most informative, but it might require a little explaining (or you could go back and read the whole review, which if you’re interested in this would be worth your time). The red and green lines at the bottom are SAS drives using the AHCI software stack. With 4K reads you get about 2,500 IOPS for every 1% of a CPU and for 128K reads you’re getting about 500 IOPS. BUT with the NVMe drives you’re doing about 25,000 4K read IOPS for the same amount of CPU, that’s a 10x improvement. The differences aren’t nearly as great for the large block I/O, but for the workloads typically put on an all flash array, like online transaction processing, the CPU impact is still almost 10x better.
This not only has big implications for analytics workloads running on servers but also has some pretty big implications for storage controllers which is where most of the really intensive I/O to flash is happening today.
Why improving CPU utilisation is so important for storage controllers.
The sad fact is that as awesome as all flash arrays are, the storage controllers put a pretty big speed bump in between the server and the flash drives. This is justified because the storage controller aggregates the performance of multiple devices to a degree and adds a lot of value in data services like replication, QoS, dedup, compression and better capacity utilisation from sharing an expensive resource.
How big is that speed bump ? Well it depends on the software architecture of the array, and how much CPU you throw at it. To demonstrate, let's take a look at the two different storage controllers from NetApp.
The first is the entry level All Flash Array the A200 which got a pretty thorough and quite enthusiastic set of tests and test results from storagereview.com here
For a fairly “real world” workload which uses Oracle with 80% reads and 20% writes using an 8 kibibyte (8192 bits .. what most people call a kilobyte which is really 8000 bits) block size they reported the following
“With the Oracle 80-20, the A200 started off at a latency of 0.38ms and stayed under 1ms until it was just under 65K IOPS. It peaked at 129K IOPS with a latency of 4.9ms”
That’s excellent performance for an entry level array, way more than most people need in the entry market, but that performance comes out of 24 SSDs .. that’s about 5300 IOPS per SSD at almost 5ms and 2700 IOPS per SSD at 1ms. Even when you take into account that there is a decent number of writes and that we’re talking 8K rather than 4K I/O, that doesn’t look great when you consider the specs for 100% 4K reads from the kinds of SSDs array vendors use are well over 200,000 IOPS at 80 microseconds. All flash might make array controllers performance look really good, but at the entry level, that performance benefit isn’t reciprocated. So what happens when you throw a lot more CPU cores at the same kind of workload ?
Fortunately NetApp published a technical report TR-4582 NetApp AFF A700 Performance with Oracle Database which helps to answer that qestion
This workload is more write intensive than the one used by storagereview.com, which makes it more difficult as writes require more CPU to process than reads, but it’s a good example because it also compares the previous high-end model from NetApp, and also between code versions, showing the improvements that can be made between code releases.
By throwing a bunch more CPU’s at the same number and kinds of SSD’s using the same storage operating systems for a similar kind of workload you get a LOT more IOPS per SSD. Using the data from the report at the 512,000 IOPS mark, and dividing by 23 drives (one was a hot spare) you end up with 22,260 IOPS per drive with an overall latency of about 600 microseconds
This is still a long way off the performance claimed on most datasheets for SSD’s, although they are usually 100% read, and a full discussion of why there is such a discrepancy will probably require a blog post all of its own. Having said that, getting over 22,000 IOPS per device for that workload is an exceptionally good result for the industry. As a basis of comparison, the competitor whose CEO claimed ONTAP treats SSD like mechanical disk, recently published a benchmark of their most powerful SAS based controller with a similar workload, and that only managed 6,250 IOPS per device with significantly higher latency.
So what would happen if you did this with NVMe ?
Unfortunately, I can’t publish performance data for unreleased products and plans that may or may not change, though if you sign an NDA, I might be able to tell you some interesting stuff, (or I might not), but if you’ll allow me some conjecture, I'd like to you to keep in mind that 3X – 10X improvement in CPU utilisation thanks to the simpler NVMe software stack I wrote about earlier.
The one thing that isn’t mentioned in either TR-4582 or the competitors benchmark is how much CPU was being used to generate that workload, so I’ll be a bit cheeky and use the results of a performance test done by the wonderful @netofrombrazil in NetApp's CPOC customer proof of concept labs on a single A700s with 24 drives just for fun. This kind of stuff typically doesn’t get published because nobody runs 100% random 4K reads in the real world, but for the sake of this blog I think its ok to use because it clearly demonstrates what I’m talking about.
It shows ONTAP doing about 42,500 IOPS per drive under 1ms with current software releases. I reckon if Neto had wanted to go over the 1ms mark, he could have pushed it to 1100000 IOPS. And just in case you’re wondering, these were IOPS from SSD not cache. You’ll notice that the CPU utilisation in this test was sitting just under the 80% mark, but the SSD’s were only busy about 40% of the time. This demonstrates pretty clearly that even in with a really good benchmark result, just doubling the CPU efficiency on the storage controller should have a significant impact.
So now, ask yourself, what do you think would happen, if the part of ONTAP whose job it was to get stuff from disk and put it into controller memory was able to benefit from the 3X – 10X improvement in CPU efficiency from NVMe ? How much more work could the rest of the system do ?
To give an indication, I’ll use data from the only side by side comparison for a SAS vs NVMe based array that (I assume) is equipped with the same number of CPU cores.
Even though I personally feel the IOPS per device and latency figures for both units are a bit average compared to the 22,000+ IOPS / device of ONTAP in the A700, it does clearly show that the CPU efficiency improvements you get from the NVMe software stack, by allowing it to get a lot more performance from the same controller hardware. It’s possible the benfit is mainly due to the improved queue depths available from NVMe rather than the CPU, but the implied client queue depth per device (using Littles Law) was essentially the same and even allowing for some I/O multiplication from RAID etc its hard to believe the queue depth per device was more than 20 times the incoming queues from the load generator so they would still fit easily within the limits imposed by SAS.
These benefits might also be because they also streamlined some code in the rest of the operating system while they were doing the NVMe work, but it’s also possible that NVMe made this kind of streamlining easier to do. I’m not saying that ONTAP will necessarily see the same level of improvement (I shouldn’t make forward looking statements like that without and NDA and the associated disclaimers) as ONTAP already has a much more efficient code path from SSD to memory.
Nonetheless, there is a precedent for this kind of improvement when new hardware enables new ways improving software when NetApp brought All Flash FAS to the marketplace. It wasn’t just a matter of plugging in faster hardware and “hey presto” everything just got a lot better. Improvements in the software stack based on those new hardware possibilities had, and are still having a dramatic impact on the performance of the array as shown these slides from a three years ago when first NetApp released All Flash FAS with ONTAP 8.3.1
Assuming that history is likely to repeat itself as it so often does in IT, this means that the improvements we’re likely to see in upcoming NVMe based arrays from vendors with good engineering teams are not JUST going to come from the NVMe communication protocol alone. When you can start assuming that devices will respond in microseconds instead of milliseconds, you can start making your software developer's life a lot easier and it gives them the opportunity to develop much more efficient code. In the picture above, the orange “Storage” subsystem which is responsible for moving data to and from the media to memory would be an obvious optimisation target with a lower overhead NVMe software stack, as would the dark purple “Network” subsystem by implementing NVMe over Fabrics.
Programmer productivity and the "Attack of the killer microseconds"
There’s a good article called "The attack of the killer microseconds" in the communications of the ACM website about how microsecond level response times means thinking differently about writing code, and more importantly, how that can make the work of coding simpler here which says amongst other interesting things.
“software techniques to tolerate millisecond-scale latencies (such as software-directed context switching) scale poorly down to microseconds; the overheads in these techniques often equal or exceed the latency of the I/O device itself. As we will see, it is quite easy to take fast hardware and throw away its performance with software designed for millisecond-scale devices.”
On the associated video above the actual article, the guy who designs Google's computing infrastructure says something that we all know, but isn’t often stated so clearly.
“A key part of building an efficient computing infrastructure is thinking about programmer productivity”
The takeaway from this is that careful hardware selection and qualification can make a programmer’s life a LOT easier, and that’s one of the reasons why A200, A300 and A700 controllers all managed to get such large performance increases with ONTAP 9.2. It wasn’t just faster CPUs it was matching the hardware to the requirements of the developers. That same principle applies to application programmers too, and that’s something I’d like to talk a bit about in my next blog where I’ll be talking more about the benefits of NVMe over fabrics.
<snark>
This is where I get a bit snarky, if you’re not interested in vendors taking pot-shots at each other, feel free to stop reading, it won’t make your life any better. Even though venting my spleen in front of a virtual audience does give me a certain amount of personal satisfaction, and it might give you an better awareness of the kind of misinformation that’s floating around and the response I think it deserves.
The company that makes the arrays who did the side by side SAS vs NVMe benchmark consistently makes assertions that you MUST use NVMe to get the IO density required in order to use larger devices. Furthermore they claim that NetApp (and others) can’t use large (15TB) drives because the I/O density on the large drives with SAS isn’t good enough.
This kind of misinformation annoys me, not just because it’s misguided and self serving, but also because NetApp’s current I/O for OLTP workloads per SAS attached drive is better than theirs by a factor of three or more for their SAS offerings, and 20% better than their all NVMe based solution. Maybe this is a case of when you point an accusing finger at someone there are three more curled back pointing back at you. Just because that company doesn’t feel comfortable about their I/O density per drive, doesn’t mean that NetApp has to be. It’s doubly galling when their CEO says the reason NetApp can’t use these larger drives, (when clearly, they can), is because ONTAP “treats flash like mechanical disk”, which is just plain wrong. If that were true the massive performance gains ONTAP made between version 8.2 and 8.3.1 with further gains in every release through to 9.2 would never have happened. Again, the proof is in the numbers, and I feel safe in saying ONTAP 9.2 is better optimised for flash performance and I/O density than that specific competitor, and indeed most if not all other competitors I come across on a regular basis.
In short “I call shenanigans, shenanigans have been called !”
</snark>
Other posts in this series
- How cool is NVMe ? Part 1 - Snark Hunting
- How cool is NVMe ? Part 2 -> Throughput
- How Cool is NVMe ? - Part 3 - No waiting in Queues
- How cool is NVMe ? – Part 4 – CPU and Software Efficiency
- How cool is NVMe – A summary of benefits for external storage arrays
General Manager
7 年Well done.
Web3 Builder | C Suite | Strategic Partnerships | Explosive Growth Leader | ex Microsoft, Check Point, IBM
7 年Excellent series on NVMe John!
Proprietor Hemsworth & Son Handyman Services
7 年I feel like an NVME expert now. Great post John.