NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2 – Be Wary of Memory Compression
In Part 1 of this blog series I promised that I would talk “about the most recent datacenter workloads, including a little on benchmark design and the difference between a benchmark and a performance test. This will let me show off some brand-new test results published by both NetApp and Pure for end to end NVMe with context and comparability”
Firstly, there are very few true benchmarks in the storage industry, because using my favourite dictionary definition for a benchmark most things people talk about as being a benchmark simply don’t make the grade because they need to provide ..
“a level of quality that can be used as a standard when comparing other things”
The two key words here are “standard” and “comparing”. Almost any two well configured storage performance tests can be considered benchmarks if they’re run with the same workload generator with the same, or very similar configurations and a level of disclosure that allows them to be repeated by others. There are a number of performance tests which get close to this definition .. one good example IMO is https://www.netapp.com/us/media/tr-4767.pdf which uses the SLOB workload generator which includes pretty graphs like this
This TR shows significant improvements from changing from SCSI to NVMe at the front end and the improvements from one release to the next. It also provides good level of disclosure, including the size of the database, and a bunch of other stuff, although it doesn’t explicitly give all of the 22 individually tuneable SLOB parameters.
Be careful comparing "benchmarks"
This is problematic because like most other published SLOB results from multiple vendors, it increases the work being done in the performance test by increasing the number of SLOB "users" which makes transparency and valid comparisons between different vendors benchmarks problematic. This is because a SLOB user is not a thread, a transaction, or anything like that, it's just a session hurling IO requests into a central pool. Depending on the other settings, 16 SLOB users in one configuration might be equal to 8 SLOB users in another. As an example, let's say you build a 1TB database with 1024 users. You might be able to reach the saturation point in your test with only 128 users with only 1/8th of the total SLOB users only touching 1/8th of the database itself.
I point this out because in the past, a lot of vendors might cheat a little by doing tests with a highly cacheable workload, but these days, it's hard not to do that without going to significant effort like testing with half-petabyte databases.
The same thing applies to other benchmarks like the swingbench order entry benchmark which in its own words
“It introduces heavy contention on a small number of tables and is designed to stress interconnects and memory”.
This small working set size issue become even more of a challenge in the presence of inline storage compression where research articles shows that the datasets can compressed to a factor of 4 or 5.
Testing NVMe media or DRAM speeds ?
That means that a lot of the performance testing done up until now with arrays with NVMe media such including the following
NetApp’s A800
- https://www.netapp.com/us/media/tr-4767.pdf
- 1.5 TB Database with 600 GB of DRAM
Pure //X90
- https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/UCS_CVDs/flashstack_oracle_rac_19_nvme_roce_v3.html#_Toc34739681
- 1.6TB database with 1TB+ (?)* of DRAM
- https://blog.purestorage.com/flasharray-x-sets-new-bar-for-oracle-performance/
- 1.6TB (?)** database with 1TB+ (?)* DRAM
HP Primera
- https://community.hpe.com/t5/HPE-Primera-Storage/HPE-Primera-Comparing-IOPS-with-an-HPE-3PAR-array-using-HPE/td-p/
- 2.25TB Database with 2TB of Cache
Are not really that useful for evaluating the improvements in performance of NVMe media given the fairly small percentages of data that are likely to read from the SSDs. What we seem to be seeing is how fast compressed data can be read from DRAM over a network.
Pure's own performance tests contradict their marketing claims
So how does Pure stack up on its “mostly cached reads” test ? … lets look at their most recent test results that use NVMe-F as the transport on the latest Cisco equipment on their biggest array …. yes gentle reader, the best they do on a lightly loaded system with a workload that is almost certainly smaller than the DRAM on their systems with 100% highly compressible small block reads is three hundred (300) microseconds.
And if we use something that seems to have a larger working set size, the read latency ends up at nine hundred (900) microseconds under load …
Or if you use their Sysbench results you end up with four hundred (400) microseconds under a read dominated mixed workload under very light load heading up towards one thousand, one hundred and eighty (1180) microseconds under heavy load before what I assume is the point where it hits the knee of the latency curve and the numbers begin to get really ugly.
Where are the "as low as 250 microsecond" results from Pure ?
In Pure’s “Built for slow from the ground up” architecture, their published performance tests strongly infer that even when using NVMe-F, it takes a host at least 300 microseconds to retrieve data from DRAM on Pures fastest box, taking almost four times as long as the 80 microseconds it takes for a host to retrieve data using NVMe-F from NetApp’s A800, and even an old-school A700 with end-to-end SCSI comes in at about 120 microseconds.
At this point, I assume Pure’s minimum latency is three to four times slower than NetApp’s because their 32KiB compression group size is four times larger than NetApp’s.
Now, if Pure’s own benchmarks are painting a true picture, and that 300 microseconds is as fast as they can grab stuff from DRAM, I’m curious to see how they’re planning on achieving their “as low as 250 microsecond” marketing claims for I/O from their proprietary SSD, or Optane or any other storage media. Not only do they start with what appears to be a minimum latency higher than 250 microseconds, but they will then need to add the time it takes to fetch data from media to DRAM first. How much more latency does that add in a Pure system ? I think the answer to that is in their FIO results which start at 700 microseconds, so it looks like it takes about another 400 microseconds to grab a data block from their Direct Flash NVMe devices, and that gets progressively worse as more write workload is added to the system. In the absence of evidence to the contrary .. I’m calling Pure’s statements like
“remove all the legacy protocols out of the array first, as this is where the biggest bottle neck exists”
as PURE-TRIPE*, as they appear to have much bigger and far worse problems with their “Built for slow from the ground up” architecture. An architecture that they seem to be stuck with if, their lacklustre six percent improvement in SLOB results between the //M70 and the //X90 are any indication.
Pure should prove they can achieve their marketing claims, or apologise for misleading the tech community
Given the amount of Business Smack-talk they’ve been throwing around, I politely contend that they should put up or shut-up. Even today I had someone tell me a Pure reseller was claiming they could achieve 150 microseconds, which based on their own results I'm calling PURE-TRIPE, and I strongly believe they either prove they can "walk the walk" as well as they "talk the talk" or retract their claims around superior performance because of DirectFlash or future Optane enhancements from here, here, here and most importantly here
Furthermore they should issue a public apology to the tech media who have reported their claims in good faith, and do so without wriggling out of it by using weasel words like “as low as”.
In my next blog, I'll talk a little about the results of running from less optimistic performance testing, and what kind of storage performance you should expect from public and private cloud deployments.
* DRAM sizes are not disclosed by Pure, I'm using data from https://www.rajeshvu.com/storage/pure/articles/pure-flasharray-models which seems credible on face value
**Database size not disclosed assuming it is the same size and methodology as the Cisco CVD
*** TRIPE = Technically Risible Inaccuracy Propagation Engineering
Principal Systems Engineer, Global Technology Office
4 年John, thanks for the article. It is a very good read. From my own observations: 1. Pure does not cache reads, but they hold metadata in RAM (which isn't infinite in capacity); 2. The "built from the ground up" architecture is a single active controller system that cannot scale (except capacity); this introduces a performance bottleneck and this is why so small performance improvement over generations; 3. They removed flash management from dedicated chips on ssds and put it on single active controller -- the more updates you do, the closer to a disaster; 4. Their marketing is really innovative.