登录查看更多内容

How Many IOPS Do You Need For Real-World Storage Performance?

Adam Zagorski

发布日期: 2017年8月2日

We hear lots of hype today about millions of IOPS from someone’s latest flash offering. It’s true that these units are very fast, but the devil is in the detail and often using the products yields a much weaker performance than the marketing would lead you to expect. That’s because most vendors measure their performance using highly tweaked benchmark software. With this type of code, the devil is in the details.

What eats up all of that performance? In the real world, events are not as smoothly sequenced as they are in a benchmark. Data requests are not evenly spread over all the storage drives, nor are they evenly spread in time. In fact, I/O goes where the apps direct, which means some files get much more access, making the drives they are on work hard but leaving other drives nearly idling.

Recognized as a major issue, bottlenecks of this type can bring real-world performance down by large factors. Attempts were made with hard-drive RAID arrays to spread the data over as many drives as possible, but the approach was typically limited by the small number of drives in a typical virtual drive (LUN - Logical Unit Number) and, more importantly, by the fact that spreading out the load created many more I/O operations, each of which used up a piece of the available performance. The net result was just a limited recovery of theoretical performance.

Fast forward to using flash/SSD instead of spinning rust. The new drives have little access latency beyond network traffic and address compute time. Much of the penalties for spreading data out go away. There is still a cost for handling 10 I/Os instead of 1, incurred in processing time in the drive and host. Newer data integrity approaches such as erasure coding mange spreading out data more efficiently, too.

Overall, then, we can avoid the drive bottleneck much more easily with SSD or flash, though only if data is spread over many drives.

The spiky nature of I/O is a different problem. A file may be quiescent one minute and hammered the next. It’s the nature of storage that usage brings all the elements of an app together in the same LUN or bucket and traditionally, the LUN was restricted to just a fixed set of drives. This type of access pattern can leave much of the I/O pool idling and, with the remainder limited by drive or interface maximums, this limits performance of the appliance cluster or flash unit. The reduction in effective throughput can be large. If apps are hitting a 6-drive LUN in a 60-drive array, the maximum performance is just 10%.

Again, one solution is spreading data out, but, if the access pattern is extending run times significantly, creating replicas of the files or using large erasure code settings can help. Analyzing the bottlenecks and spreading the files over more drive sets is a better solution.

All of this presupposes that the storage pool is working properly. Even a single slow or failed drive can throw a wrench in the works. Let’s take an erasure coded object. It’s spread over typically 10+6 drives. In reality, this is 16 drives with an effective capacity of 10, but the protection scheme allows 6 drives, or the 6 appliances that host them, to fail before data is lost.

When data is read, all 16 drives are accessed. If one is running slow, it doesn’t matter how fast the others are. The slow drive determines the access latency, but it won’t be flagged by today’s storage software and so the app chokes.

A failure of a drive has even more effect. Though up to 6 drives in the example can fail, when just 1 goes down, the erasure group has to be rebuilt from the remaining 15 and this slows I/O. The system will recover automatically by setting up a replacement drive and copying data, but this is slow.

Companies such as Enmotus have realized that significant analytics and automation can make much of the pain of bottlenecks and slow or failed drives go away. They are taking a leading role in setting a direction and standards of operation. Enmotus is building a framework for future software development, applicable to virtualized clusters and private and public clouds.

The first step is real-time monitoring of performance and events, but the crucial step is to apply a variety of analytic approaches to the problem.

Consider this to be a Big Data issue, with a relatively large dataflow of metrics. There are both structured and unstructured queries to be performed on this datastream and SQL and Big Data tools make sense as part of a monitoring toolset. Incidentally, taking the view of this as a datastream, much like IoT, allows for the microservices to be open-ended and extensible and will encourage ISVs to enter the market.

The next part of the toolset is an interface to storage orchestration. This interface allows a much more rapid response to the clusters issues, avoiding bottlenecks quickly by adding resources or even moving datasets around the pool of storage.

The automation feature also allows policy-driven operations in storage, such as elevating a dataset to a higher, faster storage tier in anticipation of an app using the data. This can be driven by timestamps or, more elegantly, by AI approaches to analyzing storage usage.

The beauty of this analytics approach is that it facilitates and encourages third-party software developers to create a rich ecosystem of tools. This will improve the value of the storage pool and move performance much closer to the theoretical values seen in benchmarks.

Today’s storage appliances and software are heading in the direction of bottleneck detection and remediation.

要查看或添加评论，请登录

Adam Zagorski的更多文章

Expanding Optane? Capacity with QLC: Unleashing Performance and Capacity

2019年10月10日

Expanding Optane? Capacity with QLC: Unleashing Performance and Capacity

Note: This article was initially published on Intel's Datacenter Builders Blog Intel? Optane? DC SSDs offer…
Software-defined storage finds its feet

2018年5月31日

Software-defined storage finds its feet

Software-defined storage (SDS) is a very promising way to virtualize and scale storage services in a cloud or cluster…
A conversation on the future of computing with Jim O’Reilly

2018年3月26日

A conversation on the future of computing with Jim O’Reilly

Adam Zagorski interviews Jim O’Reilly, well-known computer pundit and one of the fathers of computer storage networks…
The NVDIMM Challenge

2018年2月15日

The NVDIMM Challenge

NVDIMM is not new. They’ve been around for at least 6 years, but it’s taken a more holistic view of system…

2 条评论
Why Storage is Growing Rapidly!

2017年10月9日

Why Storage is Growing Rapidly!

Where is storage going from a size perspective? On the one hand, there is pressure to keep cost under control, but on…
Using advanced analytics to admin a storage pool

2017年8月31日

Using advanced analytics to admin a storage pool

Manual administration of a virtualized storage pool is impossible. The pace of change and the complexity of the…
How To Prevent Over-Provisioning - Dynamically Match Workloads With Storage Resources

2017年6月8日

How To Prevent Over-Provisioning - Dynamically Match Workloads With Storage Resources

The Greek philosopher Heraclitus said, “The only thing that is constant is change.” This adage rings true today in most…
How Much Flash Do You Need?

2017年4月13日

How Much Flash Do You Need?

Enmotus Virtual SSD – Cost Optimized Storage Performance The question isn’t if you need flash; clearly that answer is…
Storage Automation In Next Generation Data Centers

2017年1月4日

Storage Automation In Next Generation Data Centers

Automation of device management and performance monitoring analytics are necessary to control costs of web scale data…

2 条评论
Flash Tiering: The Future of Hyper Converged

2016年12月8日

Flash Tiering: The Future of Hyper Converged

Today’s hyper converged solutions Based on the idea that current generation flash-based storage nodes and compact…

See all articles

How Many IOPS Do You Need For Real-World Storage Performance?

Adam Zagorski

Adam Zagorski的更多文章

社区洞察

其他会员也浏览了

Understanding Predictive Failure in RAID Disks and How to Handle It....

TT#02: "Tech Talk on Cache"

NetScaler Advanced vs Premium, what do you get with each?

MULTI NODE KUBE CLUSTER SETUP

Industry 4.0-World’s Largest Capacity SSD

Navigating the Digital Maze: Exploring File Systems, Their Uses, and a Comparative Analysis

675 – Does size matter?

How to Recover Deleted Files from Recycle Bin after Empty: Top Methods

Exchange Collocation - worth it or not?

Adam Zagorski的更多文章

Expanding Optane? Capacity with QLC: Unleashing Performance and Capacity

Software-defined storage finds its feet

A conversation on the future of computing with Jim O’Reilly

The NVDIMM Challenge

Why Storage is Growing Rapidly!

Using advanced analytics to admin a storage pool

How To Prevent Over-Provisioning - Dynamically Match Workloads With Storage Resources

How Much Flash Do You Need?

Storage Automation In Next Generation Data Centers

Flash Tiering: The Future of Hyper Converged

社区洞察

其他会员也浏览了

Understanding Predictive Failure in RAID Disks and How to Handle It....

TT#02: "Tech Talk on Cache"

NetScaler Advanced vs Premium, what do you get with each?

MULTI NODE KUBE CLUSTER SETUP

Industry 4.0-World’s Largest Capacity SSD

Navigating the Digital Maze: Exploring File Systems, Their Uses, and a Comparative Analysis

675 – Does size matter?

How to Recover Deleted Files from Recycle Bin after Empty: Top Methods

Exchange Collocation - worth it or not?