登录查看更多内容

CPU - The New Bottleneck? (Expanded Edition)

Jeffrey Slapp

Making AI Possible | Next Generation Open Standards Data Platform for Block, File, Object, and AI Workloads - Any Hardware at Any Scale

发布日期: 2016年9月22日

INTRODUCTION

An interesting phenomenon is occurring with the relationship between the application, the CPU, and the I/O (most notably where the data resides). Prior to modern parallel I/O processing, the largest bottleneck which existed in the I/O stack was unquestionably the storage sub-system. Storage devices are at best many orders of magnitude slower than the CPU (where the I/O demand is generated), the channels to those storage devices are limited, and the storage devices themselves (which respond to the I/O requests) reside at a point in the stack furthest from the source where the I/O is generated. However, when you have an architecture which handles both the generated I/O and the response to the I/O at the same point in the stack (the CPU), the bottleneck now moves to the CPU itself, as we will explore in this article.

Don't worry though, the situation isn't as dire as it sounds; there will always be a relative bottleneck somewhere in the system, but when the latency of the slowest component approaches that of the fastest component, the efficiency increases significantly system-wide. If you are going to have a bottleneck anywhere in the system, I would argue its best to have it at the CPU because you want the component which is doing the heavy lifting to lift as much and as often as possible (unless your application is broken, the work which is being done is, or should be, useful).

CORRELATION: WORK PER UNIT TIME AND I/O LATENCY

Application I/O demands within an architecture tend to increase either due to the introduction of sustained high-intensity workloads such as Online Transaction Processing (OLTP) or an increase in the number of workloads running concurrently, or worst case, both. Certainly virtualization technologies such as VMware ESX and Microsoft Hyper-V have contributed to concurrency. In either scenario however,

the measurement of application productivity or work completed per unit time is inversely proportional to the latency between the source where the I/O request is generated (the CPU) and where the I/O request is being fulfilled (the storage system).

In other words, the less latency which exists between the CPU and the storage, the more work can be completed in a given period of time. Also interesting to note, the latency which I refer to is not simply the storage media response time, but is also inclusive of the latency introduced due to round-trip signal propagation delay. I/O requests must traverse the many layers which exist between the CPU, the end-point storage system, and back again in order to accomplish I/O completion.

Simply put, if we can close the distance between where the I/O is generated (the CPU) and the storage while simultaneously improving storage media response time, we may have something very useful.

Let's use a hypothetical model where the storage system is infinitely fast and is running so close to the CPU that the round-trip latency is zero. In this scenario, the limiting factor would now become the CPU itself, whereby the CPU could potentially be 90-100% utilized by the application(s) (even if only for short periods of time) because there is no delay in I/O processing.

While this may sound problematic, it really isn't. Remember, in today's typical enterprise server architectures, you will find as many as 192 logical processors in a single server, with the number of processors increasing 20% each year. If the time delta between when the application I/O is generated and the storage system processing the I/O is very narrow (as it would be in a parallel I/O system, which we will explore shortly), then it really makes no difference which one is the bottleneck because their latency delta is extremely narrow (certainly more narrow than that of a non-parallel system).

Also worthy of noting, with a parallel I/O system, the number of CPU engaged will depend on the storage I/O being processed. If the process is compute-heavy, then the CPUs will be free to process the compute functions without being interrupted by storage processing, since storage processing for that period of time is at a minimum. This is the dynamic nature of Parallel I/O.

TASK-TIME COMPLETION

Below is an illustration showing the impact an end-to-end parallel I/O system can have on the CPU utilization pattern, but most importantly the task completion time. This example shows a relative 4x improvement in task completion time for a singular task with parallel I/O processing. Generally speaking, it is not uncommon to see the time-to-completion reduced by more than an order of magnitude depending on workload pattern and availability of resources.

In reality, there is latency between the CPU and the storage system regardless of where the storage actually resides. However, while an infinitely fast storage system does not exist, we do have technology today which provides parallel I/O processing that is so fast, the CPU does in fact become the limiting component, just as I showed in the hypothetical scenario.

With end-to-end parallel I/O processing, that is, where storage I/O processing is occurring on the CPU along with the application which is generating the I/O, the CPU becomes the new bottleneck. This is precisely what we observe in the real world.

Below is a screenshot of a very basic IOmeter test running in a virtual machine. The virtual machine has DataCore Parallel I/O technology installed. In this case the workload is highly parallel (driven by many workers across many CPUs) and the storage processing is also highly parallel and spread across the same CPUs. The workload is a 90% read, 10% write, 100% random 8k block pattern.

While these are very impressive performance numbers from inside a virtual machine (in particular the 1.03M IOps at 35 usecs), the main point I want to draw out is this: the CPU is the component which is preventing us from achieving even higher performance levels.

SIGNIFICANT IMPLICATIONS

In parallel I/O systems, for a given workload, the CPU utilization pattern tends to change from a longer and less-intensive pattern to shorter and more-intensive burst pattern. The result is the same amount of work being completed in a shorter period of time. Interestingly, in most environments over time, the workload trend generally increases (aka. higher concurrency), demanding more and more of the CPU at higher peak utilization. Previously inaccessible CPU cycles are now generally available since the storage system is effectively out of the way of the application. We see a correlation between the observed behavior and Gustafson's Trend.

"One does not take a fixed-size problem and run it on various numbers of processors except when doing academic research; in practice, the problem size scales with the number of processors. When given a more powerful processor, the problem generally expands to make use of the increased facilities... Hence, it may be most realistic to assume that run time, not problem size, is constant."

It seems in our world, the amount of work to do is always increasing, albeit at different rates. As Gustafson stated, it is more realistic to assume runtime or workload processing time stays constant while the problem size or workload demands increase.

Within context of Gustafson's Law, instead of a singular task simply completing faster, we can now illustrate the task-time completion curve representing multiple tasks completing in the same amount of time as the singular task before Parallel I/O was introduced. More work completed in the same amount of time translates as an increase in overall system efficiency.

CONCLUSION

Compute hypervisor technologies such as VMware certainly made multicore processors justifiable allowing more workloads (i.e. VMs) to run on the same platform. However, this only aggravated serial I/O processing bottlenecks at the storage layer even more while leaving many CPU cores underutilized.

Parallel I/O technology brings a totally new dimension to the demand for more cores within the CPU architecture. Like I have said many times before, we now live in a world where we have highly parallel application layers coupled with a very powerful parallel storage processing layer (see Parallel Application Meets Parallel Storage). This new paradigm will most certainly justify higher density CPUs going forward. The good news is the processor manufacturers show no signs of slowing the progression of CPU core density.

A recent article from 451 Research further explains the impact of Parallel I/O technology on our world: DataCore looks to push I/O processing through the roof for all applications

Berth-Olof Bergman

CTO at PQURE Technology AB

8 年

The hardware has evolved in performance drastically and the software evolves in the opposite direction. The slow software are not capabable of feeding the fast hardware, so the only gain is the short burst in the hardware I/O cycle.

Berth-Olof Bergman

CTO at PQURE Technology AB

8 年

The CPUs are amazing as it performs well with the crap software running on it!!!

Zsolt Kerekes

writing yet another book with "goblin" in its title, despite other words being available

8 年

Step back a bit. Don't keep shouting orders at the poor CPU butler. It will work better with less paranoid software.

Joe Chang

8 年

Amber Huffman has a NVMe slide deck, that said SATA IO was 26K CPU-cycles, SAS might have been 13K and NVMe was 2600. 26K CPU-cycles per IO means a 2.6GHz CPU should support 100K IOPS per core, and 2.6K per IO translates to 1M IOPS per core. This is just for the IO. Still, it implies IOMeter is burning CPU on something else.

Jonas Kallmén

Founder of ICP, Intelligent Connected Products, helping you drive change within AI, cybersecurity and sustainability

8 年

Parallel I/O technology appears worth considering.

查看更多评论

要查看或添加评论，请登录

Jeffrey Slapp的更多文章

Harnessing the Power of AI: Next-Generation Software-Driven Architecture

2024年8月27日

Harnessing the Power of AI: Next-Generation Software-Driven Architecture

Artificial Intelligence (AI) is reshaping industries by enabling data analytics, decision-making, and automation across…
The Next Era Of Software-Defined Storage

2024年8月26日

The Next Era Of Software-Defined Storage

INTRODUCTION Over my 25-year career in technology, I’ve witnessed numerous paradigm-shifting advancements, largely…
InfiniVault vs. StratiSTOR

2023年12月12日

InfiniVault vs. StratiSTOR

A common question we often receive here at SteelDome is: How does InfiniVault relate to StratiSTOR? The short answer…
The Next Generation of Data Storage Architecture

2023年9月6日

The Next Generation of Data Storage Architecture

We are in the midst of another major evolution occurring in the storage industry. In today's data-driven world…

1 条评论
THE FUTURE OF STORAGE

2023年6月25日

THE FUTURE OF STORAGE

INTRODUCTION The data storage industry has witnessed remarkable advancements over the past three decades, transforming…
The 2019 IT Forecast: Cloudy on the Edge with a certainty of more data

2018年12月9日

The 2019 IT Forecast: Cloudy on the Edge with a certainty of more data

INTRODUCTION We live in a world with vast amounts of data (raw input, essentially unlimited) and information (organized…

1 条评论
Containers: The Next Leap In Concurrency

2017年4月23日

Containers: The Next Leap In Concurrency

There have been many significant developments related to containers over the last six months. As you may know…
Why Adaptive Parallelization Matters

2017年4月3日

Why Adaptive Parallelization Matters

Lately I have found myself engaging in many conversations related to the topic of parallelization and why it matters…
Microsoft Azure Virtual Machines Powered by DataCore Parallel I/O

2016年12月6日

Microsoft Azure Virtual Machines Powered by DataCore Parallel I/O

INTRODUCTION True software-defined storage is a very powerful aspect of today's modern datacenter. It should be…

2 条评论
Migration with DataCore SANsymphony (Expanded Edition)

2016年11月29日

Migration with DataCore SANsymphony (Expanded Edition)

INTRODUCTION Switching gears a bit from the performance discussion, I wanted to cover another aspect of DataCore…

2 条评论

See all articles

CPU - The New Bottleneck? (Expanded Edition)

Jeffrey Slapp

Making AI Possible | Next Generation Open Standards Data Platform for Block, File, Object, and AI Workloads - Any Hardware at Any Scale

INTRODUCTION

CORRELATION: WORK PER UNIT TIME AND I/O LATENCY

TASK-TIME COMPLETION

SIGNIFICANT IMPLICATIONS

CONCLUSION

Jeffrey Slapp的更多文章

社区洞察

其他会员也浏览了

RAN Functional Splits: Whose CPU Capacity is it Anyway?

Dedicated CPU Vs Shared vCPUs

Unlocking CPU Performance: Strategies to Minimize Pipeline Deadlocks and Instruction Latency, part III

Dedicated CPU Vs Shared vCPUs

CPU works. Oh really? But how?

Identifying and Resolving CPU Bottlenecks Due to Hyper-Threading

Understanding Spinlocks - How CPU supports Atomic locks

Maximizing IT Infrastructure Efficiency: The Power of Compression (Part 3 of 3... the end?)

Introduction to Raspberry Pi 5 | Specs

"The Ultimate Guide to CPU Analysis: Boosting Efficiency and Troubleshooting Performance"

INTRODUCTION

CORRELATION: WORK PER UNIT TIME AND I/O LATENCY

TASK-TIME COMPLETION

SIGNIFICANT IMPLICATIONS

CONCLUSION

Jeffrey Slapp的更多文章

Harnessing the Power of AI: Next-Generation Software-Driven Architecture

The Next Era Of Software-Defined Storage

InfiniVault vs. StratiSTOR

The Next Generation of Data Storage Architecture

THE FUTURE OF STORAGE

The 2019 IT Forecast: Cloudy on the Edge with a certainty of more data

Containers: The Next Leap In Concurrency

Why Adaptive Parallelization Matters

Microsoft Azure Virtual Machines Powered by DataCore Parallel I/O

Migration with DataCore SANsymphony (Expanded Edition)

社区洞察

其他会员也浏览了

RAN Functional Splits: Whose CPU Capacity is it Anyway?

Dedicated CPU Vs Shared vCPUs

Unlocking CPU Performance: Strategies to Minimize Pipeline Deadlocks and Instruction Latency, part III

Dedicated CPU Vs Shared vCPUs

CPU works. Oh really? But how?

Identifying and Resolving CPU Bottlenecks Due to Hyper-Threading

Understanding Spinlocks - How CPU supports Atomic locks

Maximizing IT Infrastructure Efficiency: The Power of Compression (Part 3 of 3... the end?)

Introduction to Raspberry Pi 5 | Specs

"The Ultimate Guide to CPU Analysis: Boosting Efficiency and Troubleshooting Performance"