How cool is NVMe – A summary of benefits for external storage arrays
I can hardly believe it’s coming up to two years since I last wrote a piece in this series, and a lot has happened since then. At the time, NetApp’s first NVMe array, the A800 was still in pre-production, so I had to be careful about what I could disclose, especially around the significant improvements that I knew were coming. Thankfully all of that is now public, with multiple benchmarks to pull apart and look at so we have some good evidenced based details on how much NVMe improves the performance of all flash arrays.
TL;DR - History
If you just want the bottom line, skip all the way to the end to the TL;DR summary, but if you are interested in this and haven't read the other blogs in this series, I thought I’d do a TL;DR on the other posts around the improvements you can expect by moving from SAS/SCSI attached media to NVMe attached media
From Part 2 of this series you can see per drive bandwidth is about 50% - 60% better per drive
From Part 3 you can see that while the per-drive queue depth can be much, higher, in practice it isn’t that useful because there’s not much point to having more queues than you have chips in the SSD device.
From Part 4, there was a lot of potential for improved CPU efficiencies but the benchmarks from the only vendor that was then shipping arrays with SAS and NVMe arrays from the same family were interesting, but hardly conclusive about the role of NVMe in the performance improvements.
In that last blog I promised that I’d talking about the benefits of NVMe over Fabrics, so while it might be a year or so later than first intended, I’ll cover that here too.
Firstly, let’s look at the benefits of just using NVMe media, thankfully I’ve now got lots of good benchmarking data for this, along with some theory-crafting and some data from internal NetApp engineering discussions that I’m not at full liberty to discuss, but I can talk about what they mean.
I’m going to primarily look at four NetApp Benchmarks and one for Pure. I'm picking these two vendors, as ONTAP and Purity are currently the only storage operating systems that support both SAS and NVMe media for which there are public benchmarks. I’d like to include PowerMax from Dell/EMC, but outside of marketing claims there’s no data I can find, which would be useful for this analysis.
- NetApp AFF A700 Performance with Oracle Database - March 2017 | TR-4582
- NetApp AFF A800 Performance with Oracle RAC Database - March 2019
- SPC-1-V1 A700S
- SPC-1-V3 A800
- Oracle SLOB Benchmarking for the //M70 vs //X90
The impact of controller hardware on performance outside of NVMe
Now on the face of it the A700s and the A800 are similar systems in many respects, both of them leverage commercial off the shelf (COTS) hardware components, both look like a 4RU server with internal drive bays, and both use standard PCIe cards, but there are some significant differences. I’ve also included some data for the A320 which is the most recent addition to the ONTAP family of NVMe enabled controllers.
One thing to note on this, is the "Aggregate ... Bandwidth" numbers are the theoretical total SAS or total NVMe local throughput to all the drives based on the figures in Part 2 of this series. The A320 has lower numbers because it accesses its drives via NVMe-F over four 100Gb ethernet connections rather than locally attached drives. I'm not suggesting that any of these controllers can achieve between 45 and 144 Gigabytes of throughput per controller, (those numbers are between 3 and 10 GiB/sec limited mostly by CPU grunt) those figures are there mostly to debunk the idea that using SAS somehow bottlenecks the maximum throughput of the controller. The NVMe figures are clearly better, but optimising outside the bottleneck is not a great use of engineering efforts. So while NVMe attached drives are an important component of NetApp’s newest generation of high end and mid-range storage controllers, they are only one part of the overall picture.
Comparing NetApp SPC-1 Benchmarks
So what is the impact of all this great new tech .. first let’s look at two results from the most respected benchmark for SAN performance, SPC-1
At first look, you might think … not a lot of difference there John, and I wouldn’t blame you because that’s exactly what I thought, and I raised a few questions about whether we should be publishing that, especially after a certain competitor (who doesn’t have an NVMe based array) started to use those results to say NVMe makes no difference to performance.
As it turns out these are two different versions of the same benchmark (version 1 for the A700 and version 3 for the A800), as a result, the SPC rules quite clearly state that doing that kind of direct comparison is against the rules, and most SPC members would agree that this is generally not thought of as fair play. (Once NetApp became aware of this, we had a quiet word with the competitor in question, they privately agreed that this wasn’t the way to go about things, and they promised that they would ask some overzealous field folks to stop misusing this data (it was all very gentlemanly). I don’t expect to see this kind of misinformation again, but if I do, I might get really cross and write another email to management.) I'm doing it in this case, because this analysis is not a direct comparison, and it uses the benchmark data to highlight one specific benefit, that of NVMe Media. If anyone believes I'm being unfair or going against the spirit or letter of the rules, let me know in the comments below.
So what are the differences between these two workloads ? As it turns out there are quite a few, and one of the nice things about SPC-1 is that there is full disclosure on just about everything, so I took a rainy weekend day to tease through it, so I could outline the most significant changes.
- 182,674 GiB A800 vs 77,504 GiB A700 - The A700 has almost double (1.953x to be precise) the ratio of DRAM to benchmark capacity, resulting in a significant increase in I/O hitting the disk media which is more expensive in terms of CPU path lengths than I/O coming from DRAM.
- Small block I/O component has increased to 8KiB IOPS in the v3 benchmark with the A800 from 4KiB IOPS on the v1 benchmark with the A700 - That change might not seem overly impactful, but it almost doubles the load on all the I/O subsystems. This alone probably consumes at least half of the increased CPU allocation in the A800 vs the A700
- A bit less cache friendly due to greater randomisation of one of the data streams. - Not hugely impactful compared to doubling the benchmark capacity or significantly increasing the average I/O size, but it adds a bit more work for the I/O prediction algorithms and caching subsystems
- Inline storage efficiencies - This dataset seems to be mostly moderately compressible 8KiB blocks with very little duplicate or inline zero filled blocks so the savings are almost all from small block compression. Even though the ONTAP data reduction algorithms are highly efficient, they’re not free either, and the CPU paths lengths are longer than when running the same workload with those features turned off. (in the v1 version of the benchmark, turning storage efficiency features on was disallowed because the blocks were highly compressible which would have given any array using inline compression an unfair advantage, outside of this one situation I cant think of a single time NetApp ever turned off efficiencies in one of their AFF benchmarks)
But what about the results, did NVMe make a difference ? Yes. If you go beyond the headline numbers and dig a little deeper into the benchmark, by comparing the A800 under very heavy load to the A700. (it’s a bit hard to eyeball, because the buckets don’t align precisely but try to trust me a little)
the big difference between them is that the all the read latency buckets have shifted left by about 100 - 150 microseconds (usec). That might not sound like much, but this results in a reduction of the average response time to 350 usec for the A800 from 450 usec for the A700. In percentage terms this is a 22% improvement in response time on a much more challenging workload.
So there you have it, a solid improvement in latency, on a notoriously difficult workload, which I like to characterise as being a bit like the Nurburgring of storage benchmarking.
Pure SLOB Benchmarking
But as much as I love SPC-1 for the details and the level of disclosure that a submission requires, most of us don’t buy flash arrays to drive around the most challenging Formula 1 racetrack known to the human race, in fact, many all flash arrays are bought to improve performance and drive the complexity out of infrastructure for database workloads which are much more amenable to read-ahead and other storage optimisation techniques. To simulate the performance of an Oracle database, the gold standard is the SLOB Oracle I/O generation tool from @kevinclosson and is used by a wide variety of vendors who use it to prove the efficacy of their wares. This brings me to another comparison of SAS vs NVMe controllers, this time from Pure Storage, who have produced a few blogs using this tool, the most useful of which for this analysis can be found here https://blog.purestorage.com/flasharray-x-sets-new-bar-for-oracle-performance/. It was written about a year ago, but it is pretty much the only reasonably up to date set of published figures for SAS vs NVMe from anyone other than NetApp.
I believe my inclusion of this data falls under fair use (if you’re a lawyer and you disagree, put something in the comment below politely stating why, and I’ll replace it with an equivalent)
To my eye, some of these results seem a little odd which I suspect is an artefact of Pure's write cacheing technology which goes initially to mirrored DRAM, not NVMe. This results in a lower latency when the system is lightly loaded and higher latency when the system is getting close to saturation, either way it doesn't make it easy to assess the impact of NVMe, so I'll focus on the 100% read workload on the assumption that the majority of I/O is coming from media.
In both of the Pure 100% read results, you’ll see an almost identical 80 microsecond difference in latency in each. This is similar, though not quite as good as the 100 to 150 microsecond difference we saw for the A700 to the A800 in the SPC-1 benchmarks, a solid improvement, not super stellar, but certainly useful.
The other thing that surprised me about this benchmark, was the performance under heavy load. One might expect that the //X90, which given its price premium over the //M70, would have significant increases in CPU, Memory and Network/Backplane vs the //X70. If so, then those improvements didn’t make much of a difference .. 3% for latency and 6% for IOPS to be precise, and that is … well .. how should I put this politely … let’s just say it’s not very much of an improvement.
Comparing OLTP/SLOB benchmarks across generations and between vendors
Now, it's never wise for a man who lives in a grass house to stow thrones. If NetApp is also seeing around 100 - 150 usec improvement in read latency, but didn’t see much top line increase in I/O in SPC-1, should they be pulling motes out of their competitors eye ?
Fortunately NetApp also uses same SLOB Oracle I/O generation tool to demonstrate the performance of both the A700 and the A800, and the results are IMHO considerably more impressive. Firstly as background, I'd like to discuss a benchmark from a few years ago which compared the improvements between two generations of NetApp's high end controllers. The majority of this improvement came from more or better memory, CPU and backplane throughput alongside some great software engineering to take advantage of those improvements, especially around parallelisation and multi-threading, removal of spinlocks and other CPU optimisations that make a big difference with flash media, but weren't worth the effort for spinning rust.
There's a few things worthy of note here,
- These are active/active dual controller tests. Unlike Pure, NetApp is able to exploit the full memory and CPU of both controllers, if you wanted to do a direct controller for controller comparison against Pure, you'd halve the I/O results for NetApp, though the latency would be the same.
- These are 75% read, so they're not directly comparable to the Pure results above which are either 100%, 90% or 70% read, the closest would be the 70% read.
- These all use similar SAS attached Flash Drives, so you can see that just upgrading software can give you a 30 - 50 usec improvement (the red and green lines). If you add other non-NVMe hardware updates along with a bit more software goodness can give you another 30 usec or so at the lowest load points.
- The upshot of this is that there are a lot of ways of improving throughput and latency that have nothing to do with whether you're attaching your media via NVMe or SAS
Now lets look at the latest A800 results
These are also 75% read with an active/active controller pair setup, so not directly comparable to the Pure results, but very much comparable to the A700 results. The Green lines are for the performance seen when attaching the hosts via FC to the front end of the controller, the blue for NVMe-F
Again there are a few things worthy of note here
- The benefits of extending the NVMe protocol all the way to the host is where the really significant benefits are. This can not be understated, and while I'm a huge fan of NFS and iSCSI for Oracle workloads, based on my experience FC is dominant technology used to attach high performance Oracle databases to storage controllers. If you are already attaching your hosts via FC for maximum performance, and you want to set yourself up for an easy and almost zero cost software upgrade, you should probably prepare yourself by reading Implementing and Configuring Modern SANs with NVMe/FC which also has some excellent NVMe benchmark data
- The top line performance improved from 475,000 IOPS at 380 usec at the "knee of the curve", to 1,600,000 IOPS at 150 usec ... thats a 233% improvement in IOPS and a 53% reduction in latency vs the A700 numbers
- I know I shouldn't compare this to Pure's M70 vs X90 70% read results, but I cant quite resist because they only achieved 454,208 IOPS at 630 microseconds for the //X90 and 440,209 IOPS at 650 microseconds for the //X70 thats a 3% latency improvement
- OK, I really, really shouldn't do this, but just in case you didn't notice the differences .. NetApp IOPS improvement was 233% ... Pure improves IOPS by 6% .. yes when NetApp does NVMe we do it over 200% better than a company that builds the majority of their innovation leadership message on how they're somehow "built from the ground up" for NVMe
- I'm kind of cheating here a bit, because I'm referencing the NVMe-F numbers, if you look at the FC front end attach numbers, I'm "only" getting 900,000 IOPS at 250 microseconds of latency which is "only" 89% better than the A700 for top line performance with a 34% reduction in latency
Summary Table for A700 vs A800 under heavy load
Summary table //M70 vs //X90 under heavy load
I recognise that this kind of vendor chest beating is kind of crass, and isn't that informative, but it is fun to say what's on your mind occasionally, especially when your team is winning :-). Really what you should be looking at is the latency difference in the A800 vs the A700 numbers for the FC front end numbers, and yet again, we see a tidy improvement in latency of about 130 microseconds .. this seems pretty consistent across all the benchmarks for a mixed read/write workload.
100% Random Reads - The easiest way of quantifying the benefit of NVMe Media in an Array
But if we really want to see where NVMe shines, its really on small block read I/O as outlined in Part 4 of this series, this is where the CPU and queuing improvements that come from the NVMe stack can really make a difference. Looking at the A800 numbers for the 100% read workload the results for FC are kind of awesome, especially the improvements between ONTAP 9.4 and 9.5.
Things to note
- The benefits of extending the NVMe protocol all the way to the host is where the really significant benefits are, NetApp has been driving these standards for years and pretty much gifted NVMe Asymmetric Namespace Access (ANA) design, which is the equivalent of ALUA for FC/iSCSI, providing automated resilience to path failure, to the industry to help speed adoption. If you want to invest with a company who is genuinely driving innovation with end to end NVMe, then NetApp is literally a year or two ahead of the rest of the industry.
- The number of IOPS at the 200 microsecond mark for FC at the front end has improved from 800,000 IOPS for the 75% read workload to 1,100,000 for the 100% read workload
- The number of IOPS at the 200 microsecond mark for end to end NVMe has improved from 1,600,000 IOPS for the 75% read workload to 2,200,000 for the 100% read workload
- Both of those results show that the more small block reads you do, the greater the benefit of the NVMe protocol, this is in complete agreement with Tom's IT Pro graphic in Part 4 of this series and shows how the impact of NVMe's CPU efficiency for small block I/O.
- Even if you cut those IOPS numbers in half, they're still WAY better than Pure's figures of 490,407 @ 440 microseconds for the //X90 or 414981 at 530 microseconds for the //X70
And one final graph, just in case your'e thinking "The A800 sure looks awesome, but it's a teeny bit out of my price range", you should also note that NetApp recently released our modular A320 but for now, heres a quick preview of the single controller performance
SLOB 100% Read Results
The A320 is about the middle of our midrange, and is specifically designed to be part of a scale-out vs scale-up architecture, sitting between the A300 and the A700. While it is the second of NetApp's end-to-end NVMe offerings, it should be noted that it was announced along with our first native NVMe shelf, the NS224. That shelf will be able to be attached to most recent mid-range and high end controllers in a future ONTAP release.
A300 and MAXdata Performance
That's not to say that the A300 or even the A220 with SAS drives are slow, far from it. The following graph from the Cisco FlexPod whitepaper shows it achieving about 200,000 reads at about 400 microseconds (more or less the same as the //M70 but with 4KiB instead of 8KiB reads), but check out the blue line .. 600,000 IOPS at a barely noticeable latency figure ... you need to zoom in to see it in the graph underneath ... yes thats between five (5) and twenty (20) microseconds, all thanks to the use of Optane and a very very efficient software stack. Thats' not NVMe exactly (its attached to the memory bus, not to the PCI bus), but if you want to start taking advantage of next generation persistent or storage class memory, then MaxData is the only game in town today.
TL;DR Summary - Will the real leader in NVMe array technology please stand up ?
NetApp has a strong track record of improving top line performance on their flagship arrays, with a cadence of about every two years, between the AFF8080EX and the A700 there was a considerable improvement in top line performance and latency for typical online transaction processing workloads. When NetApp introduced the A800 there was almost a doubling of performance for Fibre channel workloads and a 130 microsecond reduction in latency most of which can be attributed thanks to the use of NVMe media.
We can also compare this to Pure who has a similar dual controller architecture using similar off the shelf components, who also has a 2 year cadence in hardware refreshes. Both NetApp and Pure's refreshes straddled the industry transition from SAS to NVMe interconnects for media. In Pure's cases it appears there was latency reduction of about 80 microseconds, much of which I believe is due directly to the use of NVMe media.
What is strikingly different though, is that NetApp not only began with lower latency with SAS attached media than Pure did with NVMe, but that the refresh resulted in larger overall reductions in latency in both absolute and percentage terms, much larger top line IOPS numbers. These performance improvement also co-incided with, but did not depend upon the introduction of true end-to-end NVMe, AND the release of a MAXData which deserves a blog post all of its own.
All of that makes NVMe very cool indeed, especially when it is wrapped in the kind of things NetApp does, with a reliability and consistency that few, if any can match.
An excellent example of platform testing with SLOB! Excellent! https://kevinclosson.net/2017/02/10/slob-use-cases-by-industry-vendors-learn-slob-speak-the-experts-language/
Senior (retired) Presales Engineer for...Is there anything worth consulting for?
5 年Current analyses conducted in these days make the actual usability of NVMe / FC still quite limited. Most SAN environments have recently migrated to G5 devices and, above all, the adoption of G6 32 Gbps HBAs, the only ones with the appropriate firmware even if they operated at 16, is extremely small. The compatibility matrix of the target-initiator chain is quite narrow and limited to a few cases, almost entirely based on RHEL or Suse. Support for vSphere and VMFS datastores has been announced but still not usable in practical terms. Surely the new protocol has surprising features, starting with the fact that it coexists peacefully in traditional SAN FCP, being able to even share the same "zone", but it will still take a long time to see it start spreading, with caution ;-) Anyway it's really cool as recently I've posted on it. https://www.dhirubhai.net/posts/giacomomilazzo_nvme-over-fibre-channel-for-dummies-netapp-activity-6552478743150018560-JTEO (edited)
Solving Data Management Challenges in a Hybrid Cloud World ~Enterprise Account Executive at NetApp~
5 年Great Article! I liked the NVMe competitive comparison (Based on Published Data) and Oracle test results.? (Psst, the guy on the left is NetApp - but you probably already figured that out)? :-)
Thanks that is a very detailed and useful analysis. Award for most charts I’ve seen at once in a long time!