CPU - The New Bottleneck? (Expanded Edition)
Jeffrey Slapp
Making AI Possible | Next Generation Open Standards Data Platform for Block, File, Object, and AI Workloads - Any Hardware at Any Scale
INTRODUCTION
An interesting phenomenon is occurring with the relationship between the application, the CPU, and the I/O (most notably where the data resides). Prior to modern parallel I/O processing, the largest bottleneck which existed in the I/O stack was unquestionably the storage sub-system. Storage devices are at best many orders of magnitude slower than the CPU (where the I/O demand is generated), the channels to those storage devices are limited, and the storage devices themselves (which respond to the I/O requests) reside at a point in the stack furthest from the source where the I/O is generated. However, when you have an architecture which handles both the generated I/O and the response to the I/O at the same point in the stack (the CPU), the bottleneck now moves to the CPU itself, as we will explore in this article.
Don't worry though, the situation isn't as dire as it sounds; there will always be a relative bottleneck somewhere in the system, but when the latency of the slowest component approaches that of the fastest component, the efficiency increases significantly system-wide. If you are going to have a bottleneck anywhere in the system, I would argue its best to have it at the CPU because you want the component which is doing the heavy lifting to lift as much and as often as possible (unless your application is broken, the work which is being done is, or should be, useful).
CORRELATION: WORK PER UNIT TIME AND I/O LATENCY
Application I/O demands within an architecture tend to increase either due to the introduction of sustained high-intensity workloads such as Online Transaction Processing (OLTP) or an increase in the number of workloads running concurrently, or worst case, both. Certainly virtualization technologies such as VMware ESX and Microsoft Hyper-V have contributed to concurrency. In either scenario however,
the measurement of application productivity or work completed per unit time is inversely proportional to the latency between the source where the I/O request is generated (the CPU) and where the I/O request is being fulfilled (the storage system).
In other words, the less latency which exists between the CPU and the storage, the more work can be completed in a given period of time. Also interesting to note, the latency which I refer to is not simply the storage media response time, but is also inclusive of the latency introduced due to round-trip signal propagation delay. I/O requests must traverse the many layers which exist between the CPU, the end-point storage system, and back again in order to accomplish I/O completion.
Simply put, if we can close the distance between where the I/O is generated (the CPU) and the storage while simultaneously improving storage media response time, we may have something very useful.
Let's use a hypothetical model where the storage system is infinitely fast and is running so close to the CPU that the round-trip latency is zero. In this scenario, the limiting factor would now become the CPU itself, whereby the CPU could potentially be 90-100% utilized by the application(s) (even if only for short periods of time) because there is no delay in I/O processing.
While this may sound problematic, it really isn't. Remember, in today's typical enterprise server architectures, you will find as many as 192 logical processors in a single server, with the number of processors increasing 20% each year. If the time delta between when the application I/O is generated and the storage system processing the I/O is very narrow (as it would be in a parallel I/O system, which we will explore shortly), then it really makes no difference which one is the bottleneck because their latency delta is extremely narrow (certainly more narrow than that of a non-parallel system).
Also worthy of noting, with a parallel I/O system, the number of CPU engaged will depend on the storage I/O being processed. If the process is compute-heavy, then the CPUs will be free to process the compute functions without being interrupted by storage processing, since storage processing for that period of time is at a minimum. This is the dynamic nature of Parallel I/O.
TASK-TIME COMPLETION
Below is an illustration showing the impact an end-to-end parallel I/O system can have on the CPU utilization pattern, but most importantly the task completion time. This example shows a relative 4x improvement in task completion time for a singular task with parallel I/O processing. Generally speaking, it is not uncommon to see the time-to-completion reduced by more than an order of magnitude depending on workload pattern and availability of resources.
In reality, there is latency between the CPU and the storage system regardless of where the storage actually resides. However, while an infinitely fast storage system does not exist, we do have technology today which provides parallel I/O processing that is so fast, the CPU does in fact become the limiting component, just as I showed in the hypothetical scenario.
With end-to-end parallel I/O processing, that is, where storage I/O processing is occurring on the CPU along with the application which is generating the I/O, the CPU becomes the new bottleneck. This is precisely what we observe in the real world.
Below is a screenshot of a very basic IOmeter test running in a virtual machine. The virtual machine has DataCore Parallel I/O technology installed. In this case the workload is highly parallel (driven by many workers across many CPUs) and the storage processing is also highly parallel and spread across the same CPUs. The workload is a 90% read, 10% write, 100% random 8k block pattern.
While these are very impressive performance numbers from inside a virtual machine (in particular the 1.03M IOps at 35 usecs), the main point I want to draw out is this: the CPU is the component which is preventing us from achieving even higher performance levels.
SIGNIFICANT IMPLICATIONS
In parallel I/O systems, for a given workload, the CPU utilization pattern tends to change from a longer and less-intensive pattern to shorter and more-intensive burst pattern. The result is the same amount of work being completed in a shorter period of time. Interestingly, in most environments over time, the workload trend generally increases (aka. higher concurrency), demanding more and more of the CPU at higher peak utilization. Previously inaccessible CPU cycles are now generally available since the storage system is effectively out of the way of the application. We see a correlation between the observed behavior and Gustafson's Trend.
"One does not take a fixed-size problem and run it on various numbers of processors except when doing academic research; in practice, the problem size scales with the number of processors. When given a more powerful processor, the problem generally expands to make use of the increased facilities... Hence, it may be most realistic to assume that run time, not problem size, is constant."
It seems in our world, the amount of work to do is always increasing, albeit at different rates. As Gustafson stated, it is more realistic to assume runtime or workload processing time stays constant while the problem size or workload demands increase.
Within context of Gustafson's Law, instead of a singular task simply completing faster, we can now illustrate the task-time completion curve representing multiple tasks completing in the same amount of time as the singular task before Parallel I/O was introduced. More work completed in the same amount of time translates as an increase in overall system efficiency.
CONCLUSION
Compute hypervisor technologies such as VMware certainly made multicore processors justifiable allowing more workloads (i.e. VMs) to run on the same platform. However, this only aggravated serial I/O processing bottlenecks at the storage layer even more while leaving many CPU cores underutilized.
Parallel I/O technology brings a totally new dimension to the demand for more cores within the CPU architecture. Like I have said many times before, we now live in a world where we have highly parallel application layers coupled with a very powerful parallel storage processing layer (see Parallel Application Meets Parallel Storage). This new paradigm will most certainly justify higher density CPUs going forward. The good news is the processor manufacturers show no signs of slowing the progression of CPU core density.
A recent article from 451 Research further explains the impact of Parallel I/O technology on our world: DataCore looks to push I/O processing through the roof for all applications
CTO at PQURE Technology AB
8 年The hardware has evolved in performance drastically and the software evolves in the opposite direction. The slow software are not capabable of feeding the fast hardware, so the only gain is the short burst in the hardware I/O cycle.
CTO at PQURE Technology AB
8 年The CPUs are amazing as it performs well with the crap software running on it!!!
writing yet another book with "goblin" in its title, despite other words being available
8 年Step back a bit. Don't keep shouting orders at the poor CPU butler. It will work better with less paranoid software.
Amber Huffman has a NVMe slide deck, that said SATA IO was 26K CPU-cycles, SAS might have been 13K and NVMe was 2600. 26K CPU-cycles per IO means a 2.6GHz CPU should support 100K IOPS per core, and 2.6K per IO translates to 1M IOPS per core. This is just for the IO. Still, it implies IOMeter is burning CPU on something else.
Founder of ICP, Intelligent Connected Products, helping you drive change within AI, cybersecurity and sustainability
8 年Parallel I/O technology appears worth considering.