This is a blog post originally from Jeff Slapp, director, systems engineering and solution architecture at DataCore Software. See his full post here: https://www.linkedin.com/pulse/cpu-new-bottleneck-jeffrey-slapp
An interesting phenomenon is occurring with the relationship between the application, the CPU, and the I/O (most notably where the data resides). Prior to
modern parallel I/O processing, the largest bottleneck which existed in the I/O stack was unquestionably the storage sub-system. Storage devices are at
best many orders of magnitude slower than the CPU (where the I/O demand is generated), the channels to those storage devices are limited, and the storage
devices themselves (which respond to the I/O requests) reside at a point in the stack furthest from the source where the I/O is generated. However, when
you have an architecture which handles both the generated I/O and the response to the I/O at the same point in the stack (the CPU), the bottleneck now
moves to the CPU itself, as we will explore in this article.
Don’t worry though, the situation isn’t as dire as it sounds; there will always be a relative bottleneck somewhere in the system, but when the latency of
the slowest component approaches that of the fastest component, the efficiency increases significantly system-wide. If you are going to have a bottleneck
anywhere in the system, I would argue its best to have it at the CPU because you want the component which is doing the heavy lifting to lift as much and as
often as possible (unless your application is broken, the work which is being done is, or should be, useful).
CORRELATION: WORK PER UNIT TIME AND I/O LATENCY
Application I/O demands within an architecture tend to increase either due to the introduction of sustained high-intensity workloads such as Online
Transaction Processing (OLTP) or an increase in the number of workloads running concurrently, or worst case, both. Certainly virtualization technologies
such as VMware ESX and Microsoft Hyper-V have contributed to concurrency. In either scenario however,
The measurement of application productivity or work completed per unit time is inversely proportional to the latency between the source where the I/O
request is generated (the CPU) and where the I/O request is being fulfilled (the storage system).
In other words, the less latency which exists between the CPU and the storage, the more work can be completed in a given period of time. Also interesting
to note, the latency which I refer to is not simply the storage media response time, but is also inclusive of the latency introduced due to
round-trip signal propagation delay. I/O requests must traverse the many layers which exist between the CPU, the end-point storage system, and back again
in order to accomplish I/O completion.
Simply put, if we can close the distance between where the I/O is generated (the CPU) and the storage while simultaneously improving storage media
response time, we may have something very useful.
Let’s use a hypothetical model where the storage system is infinitely fast and is running so close to the CPU that the round-trip latency is zero. In this
scenario, the limiting factor would now become the CPU itself, whereby the CPU could potentially be 90-100% utilized by the application(s) (even if only
for short periods of time) because there is no delay in I/O processing.
While this may sound problematic, it really isn’t. Remember, in today’s typical enterprise server architectures, you will find as many as 192 logical
processors in a single server, with the number of processors increasing 20% each year. If the time delta between when the application I/O is generated and
the storage system processing the I/O is very narrow (as it would be in a parallel I/O system, which we will explore shortly), then it really makes no
difference which one is the bottleneck because their latency delta is extremely narrow (certainly more narrow than that of a non-parallel system).
Also worthy of noting, with a parallel I/O system, the number of CPU engaged will depend on the storage I/O being processed. If the process is
compute-heavy, then the CPUs will be free to process the compute functions without being interrupted by storage processing, since storage processing for
that period of time is at a minimum. This is the dynamic nature of Parallel I/O.
Below is an illustration showing the impact an end-to-end parallel I/O system can have on the CPU utilization pattern, but most importantly the task
completion time. This example shows a relative 4x improvement in task completion time for a singular task with parallel I/O processing. Generally speaking,
it is not uncommon to see thetime-to-completion reduced by more than an order of magnitude depending on workload pattern and availability of
In reality, there is latency between the CPU and the storage system regardless of where the storage actually resides. However, while an infinitely fast
storage system does not exist, we do have technology today which provides parallel I/O processing that is so fast, the CPU does in fact become the limiting
component, just as I showed in the hypothetical scenario.
With end-to-end parallel I/O processing, that is, where storage I/O processing is occurring on the CPU along with the application which is generating
the I/O, the CPU becomes the new bottleneck. This is precisely what we observe in the real world.
Below is a screenshot of a very basic IOmeter test running in a virtual machine. The virtual machine has DataCore Parallel I/O technology installed. In
this case the workload is highly parallel (driven by many workers across many CPUs) and the storage processing is also highly parallel and spread across
the same CPUs. The workload is a 90% read, 10% write, 100% random 8k block pattern.
While these are very impressive performance numbers from inside a virtual machine (in particular the 1.03M IOps at 35 usecs), the main point I want to draw
out is this: the CPU is the component which is preventing us from achieving even higher performance levels.
In parallel I/O systems, for a given workload, the CPU utilization pattern tends to change from a longer and less-intensive pattern to shorter and
more-intensive burst pattern. The result is the same amount of work being completed in a shorter period of time. Interestingly, in most environments over
time, the workload trend generally increases (aka. higher concurrency), demanding more and more of the CPU at higher peak utilization. Previously
inaccessible CPU cycles are now generally available since the storage system is effectively out of the way of the application. We see a correlation between
the observed behavior and
“One does not take a fixed-size problem and run it on various numbers of processors except when doing academic research; in practice,
the problem size scales with the number of processors
. When given a more powerful processor, the problem generally expands to make use of the increased facilities… Hence, it may be most realistic to
run time, not problem size, is constant.”
It seems in our world, the amount of work to do is always increasing, albeit at different rates. As Gustafson stated, it is more realistic to assume
runtime or workload processing time stays constant while the problem size or workload demands increase. More work completed in the same amount of time
translates as an increase in overall system efficiency.
Compute hypervisor technologies such as VMware certainly made multicore processors justifiable allowing more workloads (i.e. VMs) to run on the same
platform. However, this only aggravated serial I/O processing bottlenecks at the storage layer even more while leaving many CPU cores underutilized.
Parallel I/O technology brings a totally new dimension to the demand for more cores within the CPU architecture. Like I have said many times before, we now
live in a world where we have highly parallel application layers coupled with a very powerful parallel storage processing layer (see Parallel Application Meets Parallel Storage).
This new paradigm will most certainly justify higher density CPUs going forward. The good news is the processor manufacturers show no signs of slowing the
progression of CPU core density.
A recent article from 451 Research further explains the impact of Parallel I/O technology on our world:
DataCore looks to push I/O processing through the roof for all applications