登录查看更多内容

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2 – Be Wary of Memory Compression

John "Ricky" Martin

发布日期: 2020年3月31日

In Part 1 of this blog series I promised that I would talk “about the most recent datacenter workloads, including a little on benchmark design and the difference between a benchmark and a performance test. This will let me show off some brand-new test results published by both NetApp and Pure for end to end NVMe with context and comparability”

Firstly, there are very few true benchmarks in the storage industry, because using my favourite dictionary definition for a benchmark most things people talk about as being a benchmark simply don’t make the grade because they need to provide ..

“a level of quality that can be used as a standard when comparing other things”

The two key words here are “standard” and “comparing”. Almost any two well configured storage performance tests can be considered benchmarks if they’re run with the same workload generator with the same, or very similar configurations and a level of disclosure that allows them to be repeated by others. There are a number of performance tests which get close to this definition .. one good example IMO is https://www.netapp.com/us/media/tr-4767.pdf which uses the SLOB workload generator which includes pretty graphs like this

This TR shows significant improvements from changing from SCSI to NVMe at the front end and the improvements from one release to the next. It also provides good level of disclosure, including the size of the database, and a bunch of other stuff, although it doesn’t explicitly give all of the 22 individually tuneable SLOB parameters.

Be careful comparing "benchmarks"

This is problematic because like most other published SLOB results from multiple vendors, it increases the work being done in the performance test by increasing the number of SLOB "users" which makes transparency and valid comparisons between different vendors benchmarks problematic. This is because a SLOB user is not a thread, a transaction, or anything like that, it's just a session hurling IO requests into a central pool. Depending on the other settings, 16 SLOB users in one configuration might be equal to 8 SLOB users in another. As an example, let's say you build a 1TB database with 1024 users. You might be able to reach the saturation point in your test with only 128 users with only 1/8th of the total SLOB users only touching 1/8th of the database itself.

I point this out because in the past, a lot of vendors might cheat a little by doing tests with a highly cacheable workload, but these days, it's hard not to do that without going to significant effort like testing with half-petabyte databases.

The same thing applies to other benchmarks like the swingbench order entry benchmark which in its own words

“It introduces heavy contention on a small number of tables and is designed to stress interconnects and memory”.

This small working set size issue become even more of a challenge in the presence of inline storage compression where research articles shows that the datasets can compressed to a factor of 4 or 5.

Testing NVMe media or DRAM speeds ?

That means that a lot of the performance testing done up until now with arrays with NVMe media such including the following

Are not really that useful for evaluating the improvements in performance of NVMe media given the fairly small percentages of data that are likely to read from the SSDs. What we seem to be seeing is how fast compressed data can be read from DRAM over a network.

Pure's own performance tests contradict their marketing claims

So how does Pure stack up on its “mostly cached reads” test ? … lets look at their most recent test results that use NVMe-F as the transport on the latest Cisco equipment on their biggest array …. yes gentle reader, the best they do on a lightly loaded system with a workload that is almost certainly smaller than the DRAM on their systems with 100% highly compressible small block reads is three hundred (300) microseconds.

And if we use something that seems to have a larger working set size, the read latency ends up at nine hundred (900) microseconds under load …

Or if you use their Sysbench results you end up with four hundred (400) microseconds under a read dominated mixed workload under very light load heading up towards one thousand, one hundred and eighty (1180) microseconds under heavy load before what I assume is the point where it hits the knee of the latency curve and the numbers begin to get really ugly.

Where are the "as low as 250 microsecond" results from Pure ?

In Pure’s “Built for slow from the ground up” architecture, their published performance tests strongly infer that even when using NVMe-F, it takes a host at least 300 microseconds to retrieve data from DRAM on Pures fastest box, taking almost four times as long as the 80 microseconds it takes for a host to retrieve data using NVMe-F from NetApp’s A800, and even an old-school A700 with end-to-end SCSI comes in at about 120 microseconds.

At this point, I assume Pure’s minimum latency is three to four times slower than NetApp’s because their 32KiB compression group size is four times larger than NetApp’s.

Now, if Pure’s own benchmarks are painting a true picture, and that 300 microseconds is as fast as they can grab stuff from DRAM, I’m curious to see how they’re planning on achieving their “as low as 250 microsecond” marketing claims for I/O from their proprietary SSD, or Optane or any other storage media. Not only do they start with what appears to be a minimum latency higher than 250 microseconds, but they will then need to add the time it takes to fetch data from media to DRAM first. How much more latency does that add in a Pure system ? I think the answer to that is in their FIO results which start at 700 microseconds, so it looks like it takes about another 400 microseconds to grab a data block from their Direct Flash NVMe devices, and that gets progressively worse as more write workload is added to the system. In the absence of evidence to the contrary .. I’m calling Pure’s statements like

“remove all the legacy protocols out of the array first, as this is where the biggest bottle neck exists”

as PURE-TRIPE*, as they appear to have much bigger and far worse problems with their “Built for slow from the ground up” architecture. An architecture that they seem to be stuck with if, their lacklustre six percent improvement in SLOB results between the //M70 and the //X90 are any indication.

Pure should prove they can achieve their marketing claims, or apologise for misleading the tech community

Given the amount of Business Smack-talk they’ve been throwing around, I politely contend that they should put up or shut-up. Even today I had someone tell me a Pure reseller was claiming they could achieve 150 microseconds, which based on their own results I'm calling PURE-TRIPE, and I strongly believe they either prove they can "walk the walk" as well as they "talk the talk" or retract their claims around superior performance because of DirectFlash or future Optane enhancements from here, here, here and most importantly here

Furthermore they should issue a public apology to the tech media who have reported their claims in good faith, and do so without wriggling out of it by using weasel words like “as low as”.

In my next blog, I'll talk a little about the results of running from less optimistic performance testing, and what kind of storage performance you should expect from public and private cloud deployments.

* DRAM sizes are not disclosed by Pure, I'm using data from https://www.rajeshvu.com/storage/pure/articles/pure-flasharray-models which seems credible on face value

**Database size not disclosed assuming it is the same size and methodology as the Cisco CVD

*** TRIPE = Technically Risible Inaccuracy Propagation Engineering

Maciej Przepiorka

Principal Systems Engineer, Global Technology Office

4 年

John, thanks for the article. It is a very good read. From my own observations: 1. Pure does not cache reads, but they hold metadata in RAM (which isn't infinite in capacity); 2. The "built from the ground up" architecture is a single active controller system that cannot scale (except capacity); this introduces a performance bottleneck and this is why so small performance improvement over generations; 3. They removed flash management from dedicated chips on ssds and put it on single active controller -- the more updates you do, the closer to a disaster; 4. Their marketing is really innovative.

1 次回应

查看更多评论

要查看或添加评论，请登录

John "Ricky" Martin的更多文章

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2b – How to get rid of your memory

2020年3月31日

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2b – How to get rid of your memory

This is really more of a Part 2b, because the last post got to be a little long and I was waiting on some data. In Part…
NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 1

2020年3月25日

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 1

Two and a half years ago, I started writing a series of Blogs about “How cool is NVMe”, Part One was called “Snark…

6 条评论
How cool is NVMe – A summary of benefits for external storage arrays

2019年7月24日

How cool is NVMe – A summary of benefits for external storage arrays

I can hardly believe it’s coming up to two years since I last wrote a piece in this series, and a lot has happened…

8 条评论
Better performance measurement for S3 compatible object stores

2019年2月22日

Better performance measurement for S3 compatible object stores

Back in 2003 Intel presented a paper at the Storage Industry Networling Association (SNIA) on the need for benchmarking…
How cool is Software Defined Storage ?

2017年10月26日

How cool is Software Defined Storage ?

Preamble This started out as the final post in my "How cool is NVMe" series, but after multiple drafts, one of which…

3 条评论
The biggest challenge in adopting new technologies

2017年10月24日

The biggest challenge in adopting new technologies

I was recently asked by Daniel Blickling “what, in your experience, is the biggest challenge when it comes to finding…

13 条评论
How cool is NVMe ? – Part 4 – CPU and Software Efficiency

2017年9月19日

How cool is NVMe ? – Part 4 – CPU and Software Efficiency

Intro So, if you’ve read all the other posts in this series, you’ll see that in a modern all flash array, the…

6 条评论
How Cool is NVMe ? - Part 3 - No waiting in Queues

2017年9月14日

How Cool is NVMe ? - Part 3 - No waiting in Queues

OK, so I think I’ve covered off the throughput benefits well enough, but as I said in my first post the most noticeable…

5 条评论
How cool is NVMe ? Part 2 -> Throughput

2017年9月7日

How cool is NVMe ? Part 2 -> Throughput

So I've decided to put down my snark hunt for the moment, and get technical. There are lots of claims about how much…

8 条评论
How cool is NVMe ? Part 1 - Snark Hunting

2017年9月5日

How cool is NVMe ? Part 1 - Snark Hunting

There was an interview with the CEO of a notable storage company who made a rather odd assertion when asked about their…

3 条评论

See all articles

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2 – Be Wary of Memory Compression

John "Ricky" Martin

Be careful comparing "benchmarks"

Testing NVMe media or DRAM speeds ?

NetApp’s A800

Pure //X90

HP Primera

Pure's own performance tests contradict their marketing claims

Where are the "as low as 250 microsecond" results from Pure ?

Pure should prove they can achieve their marketing claims, or apologise for misleading the tech community

John "Ricky" Martin的更多文章

社区洞察

其他会员也浏览了

Nutanix AHV 8.0: Thoughts on Performance

Top 10 Use Cases for Arista QSFP-40G XSR4 Transceivers

Health checks and validation for VMware NSX Manager using both the GUI and API

Top 10 Companies Leading the Software-Defined Data Center (SDDC) Revolution in 2024

Super Micro’s Business Model | How SMCI Makes Money

Monthly Newsletter

Minimizing the Risk of a VMware Exit

Why PCaaS is the Future of Personal Computing: Unlocking Its Benefits and Advantages

Edge Compute is lot more than simply reducing latency for applications.

Data, Storage, and SDN: An Application Example

Be careful comparing "benchmarks"

Testing NVMe media or DRAM speeds ?

NetApp’s A800

Pure //X90

HP Primera

Pure's own performance tests contradict their marketing claims

Where are the "as low as 250 microsecond" results from Pure ?

Pure should prove they can achieve their marketing claims, or apologise for misleading the tech community

John "Ricky" Martin的更多文章

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 2b – How to get rid of your memory

NVMe Performance Testing in Public, Private and Hybrid Clouds – Part 1

How cool is NVMe – A summary of benefits for external storage arrays

Better performance measurement for S3 compatible object stores

How cool is Software Defined Storage ?

The biggest challenge in adopting new technologies

How cool is NVMe ? – Part 4 – CPU and Software Efficiency

How Cool is NVMe ? - Part 3 - No waiting in Queues

How cool is NVMe ? Part 2 -> Throughput

How cool is NVMe ? Part 1 - Snark Hunting

社区洞察

其他会员也浏览了

Nutanix AHV 8.0: Thoughts on Performance

Top 10 Use Cases for Arista QSFP-40G XSR4 Transceivers

Health checks and validation for VMware NSX Manager using both the GUI and API

Top 10 Companies Leading the Software-Defined Data Center (SDDC) Revolution in 2024

Super Micro’s Business Model | How SMCI Makes Money

Monthly Newsletter

Minimizing the Risk of a VMware Exit

Why PCaaS is the Future of Personal Computing: Unlocking Its Benefits and Advantages

Edge Compute is lot more than simply reducing latency for applications.

Data, Storage, and SDN: An Application Example