Vectorization Part 1 – The Rise of Parallelism

Vectorization Part 1 – The Rise of Parallelism

New challenges in the financial markets driven by changes in market structure and regulations and accounting rules like Basel III, EMIR, Dodd Frank, MiFID II, Solvency II, IFRS 13, IRFS 9, and FRTB have increased demand for higher performance risk and analytics. Problems like XVA require orders of magnitude more calculations for accurate results. This demand for higher performance has put a focus on how to get the most out of the latest generation of hardware.

This is the first in a series of blogs on Vectorization which is a key tool for dramatically improving the performance of code running on modern CPUs. Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).

In this blog I cover how CPUs have evolved and how software must leverage both Threading and Vectorization to get the highest performance possible from the latest generation of processors.

The rise of parallelism

For the past decade, Moore’s law has continued to prevail, but while chip makers have continued to pack more transistors into every square inch of silicon, the focus of innovation has moved away from greater clock speeds and towards multicore and manycore architectures.

As Herb Sutter famously observed in 2005, for developers this architectural shift meant the end of the “Free Lunch”, where existing software automatically ran faster with each new generation of hardware. Traditional applications based on a single serial thread of instructions no longer see performance gains from new hardware as CPU clock rates have flat-lined.

Source: Data from Intel

Since that time, a great deal of focus has been given to engineering applications that are capable of exploiting the growing number of CPU cores by running multi-threaded or grid-distributed calculations. This type of parallelism has become a routine part of designing performance critical software.

At the same time as the multi core chip design has given rise to task parallelism in software design, chipmakers have also been increasing the power of a second type of parallelism, instruction level parallelism. Alongside the trend to increase core count, the width of SIMD (single instruction, multiple data) registers has been steadily increasing.  The software changes required to exploit instruction level parallelism are known as ‘vectorization’.

The most recent processors have many cores/threads and the ability to implement single instructions on an increasingly large data set (SIMD width).

Source: Intel

A key driver of these architectural change was the power/performance dynamic of the alternative architectures.

·      Wider SIMD – Linear increase in transistors and power

·      Multi core – Quadratic increase in transistors and power

·      Higher clock frequency – Cubic increase power

SIMD provides a way to increase performance using less power.

The first widely deployed desktop SIMD was with Intel's MMX extensions to the x86 architecture in 1996.

Intel’s latest generation of Xeon Phi processors codenamed Knights Landing uses Intel’s new 14nm manufacturing process, has over 70 cores on a 2D mesh structure, 4 threads per core, and can operate on 512 bit vectors (SIMD length).

Source: Intel

Software design must adapt to take advantage of these new processor technologies. Multi-threading and vectorization are each powerful tools on their own, but only by combining them can performance be maximized.

Source: Data from Intel

The above results are for a binomial options pricing example. Most existing code is either serial or implements Threading or Vectorization only. The combination of both Threading and Vectorization provides dramatic improvements and the scale of those improvements is growing with each new generation of hardware.

Modern software must leverage both Threading and Vectorization to get the highest performance possible from the latest generation of processors.

Resources

Vectorization, Kirill Rogozhin, Intel, March 2017

Vectorization of Performance Dies for the Latest AVX SIMD, Kevin O’Leary, Intel, Aug 2016,

A Guide to Vectorization with Intel? C++ Compilers, Intel, Nov 2010,

Vectorization Codebook, Intel, Sep 2015,

The Free Lunch Is Over - A Fundamental Turn Toward Concurrency in Software, Herb Sutter, March 2005

Recipe: Using Binomial Option Pricing Code as Representative Pricing Derivative Method, Shuo-li, Intel, June 2016

Rohan Douglas

Founder and CEO at Quantifi Inc

7 年

Indeed there are improvements without code changes but the optimal gain requires restructuring to introduce threading and more effective vectorisation - no small challenge for existing libraries. In a future post I'll give more details about the tools Intel provides which help in analysing code to find which areas will benefit the most from restructuring. Modern libraries need to support multi-treading. For vectorisation, the benefit of Intel's approach is that it does not require a total re-write and can be done in increments with a focus on the areas that will most easily deliver the largest performance gain.

回复

Intriguing - thanks for the heads up Rohan. Back in the early 90s I worked in petroleum production. Our core product was a reservoir simulator implementing sophisticated computational fluid dynamics in Fortran. The Cray Fortran compiler optimised for Cray's vector processor with no code changes needed. So I guess the key question will be about support for Intel's new CPUs from gcc and MSVC C++ compilers to exploit this without code changes. Because restructuring existing code that assumes single threadedness to exploit multiple cores is a big big job. It's not so long ago that I encountered a quant library at a tier 1 bank that used statics thereby forcing all pricing and risk calcs on to a single thread. Page 26 here implies compiler support will yield performance gains without code changes https://software.intel.com/sites/default/files/managed/11/56/intel-xeon-phi-processor-software-optimization-guide.pdf

Patrick McConnell

Author, Consultant, Dr. Business Administration

7 年

Rohan,very good article. I hate coming across as a naysayer, but this stuff was being done in the late 1990s at investment banks like JPMorgan and Goldman Sachs. There are many problems, such as derivatives valuation using Monte Carlo or Lattice methods as you point out, that lend themselves to immense parallelization and there are some problems that don't, like path dependent models. The lessons that were learned in the 1990s, are that the trick is to structure the input (and intermediate) data so that parallel processing can take place and pretty sophisticated synchronization methods are needed to start and stop parallel processes, that means the code must be built to allow parallelism. The architectural problem is whether it is better to use one high-power general processor or to have hundreds of fairly dumb specific ones - generally interested in your insights..

要查看或添加评论,请登录

社区洞察

其他会员也浏览了