NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2b – How to get rid of your memory
This is really more of a Part 2b, because the last post got to be a little long and I was waiting on some data. In Part 2a, I talked about the effects of memory compression on performance testing. This came up partly due to some work NetApp did while testing its most recent mid-range storage controller the A400 which has an accelerator chip that offloads some of the heavy lifting of data reduction from the main CPU. Net result was that reads coming out of memory were happening significantly faster than the A300 that it replaced. Overall response times were about the same as the A800, and there were other surprises with how data reduction worked on highly compressible workloads like those generated by SLOB, and as a result NetApp decided it was time change the way it did its performance tests, with results that look a bit like this :-)
These changes were similar to the kinds of changes that happened between SPC-1 V1 and the SPC-1 V3 which I covered in the “Comparing NetApp SPC-1 Benchmarks” section of the final post in my “How cool is NVMe” blog series, where I warned about doing direct comparisons of seemingly similar benchmarks.
So what are these changes ? Firstly, instead of ramping up the number of users across the test, which gradually makes the workload harder to cache, the overall workload size started significantly larger than the memory of the controller and stayed that way throughout the test and I/O was increased by adding threads not users. The other thing that was changed, was to make the SLOB data much less compressible by introduction of the OBFUSCATE_COLUMNS option which was introduced in the most recent SLOB benchmark by Kevin Closson.
The full results of that change can be found here https://www.netapp.com/us/media/tr-4819-design.pdf
Comparing Performance Tests
The first thing you’ll notice if you compare this to the A800 and A700 numbers using the previous methodology is that the initial latency is about 220 microseconds and 240 microseconds respectively compared to the 90 and 100 microsecond numbers for the bigger controllers. The number of IOPS before the knee of the curve are also much smaller than the A700 and A800 numbers, however that is to be expected given the A400’s much smaller price-tag and hardware spec. A better comparison might be the A320 which had similar CPU power as the A400 but a bit more memory. The A320 was an interesting package of technology, probably a little ahead of its time and has since been replaced by the A400 which is a better match for the needs of the mid-range market.
By comparing these two performance tests we see the impact of a few things.
- Single controller A320 vs Dual controller A400
- Cache friendly workload A320 vs cache antagonistic workload A400
- Highly compressible data A320 vs partially compressible data A400
The A400 results show a more gradual climb in response times and a higher latency through most of the data points than the A320, and lower per controller numbers before the “knee of the curve”.
Does this mean the A400 was a step backwards in performance ? No, because in a like-for-like test with similar memory pressure and similar kinds of data the A400 would easily outperform the A320 thanks to that cute little accelerator card built onto the cluster interconnect, but that’s a story for another time.
Scary Engineering Graphs
But 100% read workloads are a bit silly, so I thought I’d go to the other extreme with this graph.
Personally, I find that graph little scary at first look, clearly a work of dedicated engineers rather than a slick marketing department, and as such it's well worth examining. This is the hardest SLOB workload, the 100% update. This isn’t a typical datacentre workload unless you’re doing lots of data masking for Test/Dev, so you don’t see it used much in published performance testing. I’m including it here to show what a little ripper of machine the A400 really is. This workload involves a full read modify write (RMW) workload : the bane of storage arrays, it kills cache, exposes the relatively weak write performance of NAND, creates competition for resources across software sub-systems, interconnects and back-end storage paths. Even so, the A400 still delivers almost a hundred thousand more IOPS at the 800 microsecond latency point than Pure does on an //X90 with a 70:30 read to write ratio on a cache friendly Sysbench workload.
Sexy Marketing Graph (?)
Another, more marketing friendly way of presenting the same data looks like this. It's not perfect ,as I had to eyeball the numbers but I think its fair. You’ll note that achieving 250 microsecond response times on a mid-range system is still quite possible, even under adverse circumstances and very little DRAM caching. I'm at a loss as to why a company like Pure would make this a cornerstone of their marketing while failing to deliver it AND simultaneously criticise everyone else who do demonstrate it for not using NVMe.
A call to arms
While this kind of performance testing is a big improvement on the what currently passes for industry standard practice, and I expect to see NetApp move towards this kind of cache-antagonistic performance testing going forward, it’s not a substitute for independently audited third party testing by the Storage Performance Council, or even the good folks at StorageReview, so I'm hoping to see not just better performance testing methodologies all around, but more use of full disclosure, and reproducibility, not just from NetApp, but the whole industry.
If you think I’m being unfair about this, then I’m OK with that, I'm being deliberately contentious because I think its warranted. So if you have an objection, or a better idea, let me know why and what could or should be changed and I’ll do my best to engage positively with that, because I think our industry is well overdue for a robust discussion about the fairest and most efficient ways to measure and compare storage performance, because IMHO there’s way too much TRIPE out there getting in the way of good decisions.
CPOC CPOC CPOC !!!
P.S. If you’re curious about *exactly* how many more microseconds it takes to retrieve data from NVMe and SAS disks vs DRAM on an A320 and an A400, let me know, there’s a gentleman by the name of Neto who works at NetApp's Customer Proof Of Concept (CPOC) labs that can run pretty much any test you like, he gave me those numbers 10 minutes after I asked for some clarifications .. I was shocked, because the numbers at the array as reported by our QoS subsystems, excluding network RTT were mind-blowingly low, even for SAS connected SSD. I'm tempted to put up the raw data, but they deserve more than being tacked on to the end of a Part 2B blog post.