A Survey of Computing Paradigms - From Literature to Machines
It has been proved time and again that designing better #computing systems for the future is only possible by co-designing across the computing stack or what is called as the Transformation Hierarchy involving stages from orchestrating #electrons to building software applications (Ex: Wafer Scale Engine from Cerebras Systems and #tpu from Google are good examples). #ComputerArchitecture has a huge role to play in advancing the design of computing systems for demanding applications in the coming decades. It involves the design of ISA (Instruction-Set Architecture) and #Microarchitecture which form a crucial link?between the application software and the underlying #hardware for effective #computation meeting the #Power, #Performance and #Area criteria. Here's a case-study approach to understand the significance of Computer Architecture: https://www.dhirubhai.net/pulse/why-learn-computer-architecture-case-study-approach-neeraj-cheryala/
This article compiles various fundamental specifications and techniques that a computer engineer might consider when designing a computing system now-a-days. It also presents trade-offs among these techniques in solving a specific problem accompanied by the relevant literature that talk about the approaches.?
What's Computer Architecture?
We can define Computer Architecture as the science and art of designing, selecting and interconnecting hardware components and designing hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost and other specific goals. Gene Amdahl, in IBM Journal of R&D, April 1964 described architecture as the collection of the attributes of a #system as seen by the programmer i.e. the conceptual structure and functional behavior as distinct from the organization of the #dataflow and controls, the #logic design and the physical implementation.
From the above two definitions, it is clear that Computer Architecture involves the study and design of #ISA and?microarchitecture together. Many different ISAs have been introduced over the decades and each of the ISA can have many different implementations (or microarchitectures). Some of the ISAs are listed below:
The different implementations of some of the above listed ISAs are:
Each microarchitecture can take different approaches to solve a specific computing problem under specific design?constraints and goals. Anything done in hardware without exposure to software such as #Pipelining, In-order versus Out-of-order instruction #execution, #memory access scheduling policy, Speculative Execution, Superscalar processing, #Clock gating, #Caching, Prefetching, Voltage/Frequency scaling, Error Correction etc. The remainder of this #article includes most of the fundamental specs and features that #computer engineers usually consider while designing a?computing system and the various approaches as mentioned before.
Specs of an ISA
*Von Neumann Execution Model: Burks, Goldstein, Von Neumann, "Preliminary discussion of the logicaldesign of an electronic computing instrument," 1946.
Semantic Gap and Binary Translators
Semantic Gap tells how close are the instructions, data types and addressing modes are to high-level languages (HLL). Smaller semantic gap leads to easier mapping of HLL to ISA. This means less work for software designer but more work?for hardware designer. Larger semantic gap leads to harder mapping of HLL to ISA. This means more work for software designer but less work for hardware designer i.e. optimization burden on Software.
One way to change the semantic gap tradeoff is to translate from one ISA into a different implementation ISA i.e. from an ISA which is close to HLL with complex instructions, data types and addressing modes to an implementation ISA with simple instructions, data types and addressing modes which is closer to the hardware. This can be done by using a software and hardware translator. For example, in 2020, 苹果 announced *Rosetta 2 would be bundled with #macOS Big Sur, to aid in the MAC transition to Apple Silicon. The software permits many applications compiled exclusively for execution on x86-64-based #processors to be translated for execution on Apple Silicon in the form of arm64 instructions. Even 英特尔 and AMD processors translate the base x86-64 instructions to secret micro-operations to be executed on the hardware.
*More on Rosetta 2: https://en.wikipedia.org/wiki/Rosetta_(software)
Another example for changing the semantic gap is the secret of 英伟达 *Denver for binary translation and Code?#Optimization. The purpose of NVIDIA's Dynamic Code Optimizer (DCO) is to accomplish two tasks: to translate?ARM code to Denver's native format, and to optimize this code to make it run better on Denver. With no out-of-order hardware on Denver, it is the DCO's task to find the instruction-level #parallelism within a thread to fill Denver's many #execution units and to reorder instructions around potential stalls.
*The Secret of Denver: https://www.anandtech.com/show/8701/the-google-nexus-9-review/4
Other example is the Transmeta Code Morphing software which mediates between x86 ISA and a proprietary VLIW ISA to be executed on a Crusoe #processor. [Klaiber, "The Technology Behind Crusoe Processors", Transmeta White Paper 2000.]
Example ways of designing the datapath and Control logic:
Here are some of the approaches to Instruction-level Concurrency:
Increasing Instruction Processing Throughput: PIPELINING
The basic idea of Pipelining is the execution of multiple instructions by dividing the instruction processing?cycle into distinct stages of processing and process a different instruction in each stage. Instructions consecutive in program order are processed in consecutive stages. One of the early machines with Pipelining is IBM 7030 Stretch designed by Gene Amdahl [See: https://en.wikipedia.org/wiki/IBM_7030_Stretch] There are six fundamental ways?of handling data dependencies in a pipelined processor:
A Question to Ponder: What is the role of the hardware vs. the software in the order in which instructions are executed in the pipeline?
The decision of hardware/software impacts different metrics such as Performance, Complexity, Power Consumption, Reliability and cost etc. Software-based instruction scheduling is called as Static Scheduling. Here, the instructions are executed in the #compiler-dictated order. In contrast, with dynamic scheduling, hardware can?execute instructions out of the compiler-specified order. But static scheduling can be difficult since compiler may not the information that is determined at runtime such as variable-length operation #latency, memory addresses and branch prediction etc. However the compiler can alleviate this by estimating the unknown at static time by a process called Profiling.
Maintaining Precise Exceptions with Pipelining
There can be exceptions and interrupts in program execution and both these require stopping of the current program, saving the architectural state, handling the exception/ #interrupt and returning back to program execution. Most importantly, the architectural state should be precise when the exception is ready to be handled i.e. all previous instructions should be completely retired and no later instruction should be retired [See Section 6.1 in x86-64 Software Developer's Manual: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html]. This aids in software debugging, enables easy recovery from exceptions and enables traps into software (e.g., software?implemented opcodes). A basic solution to ensure precise exceptions is to make each operation take the same amount of time. The downside to this method that the worst-case instruction latency determines all instructions' latency (worst?is when we consider memory operations as well). Hence there are multiple *solutions proposed for this problem:
*Smith and Plezskun, "Implementing Precise Interrupts in Pipelined Processors," IEEE Trans on COmputers 1988 and ISCA 1985.
*Hwu and Patt, "Checkpoint Repair for Out-of-order Execution Machines," ISCA 1987.
Out-of-Order Execution (Dynamic Instruction Scheduling)
Although precise exceptions are ensured by reorder buffer, a true data dependence stalls the dispatch of younger instructions into execution units. Basically, if an instruction gets stalled due to not yet ready arguments, later instructions?in the program which are completely independent of this stalled instruction cannot get executed (A dispatch stall). The basic solution is to move the dependent instructions out of the way of independent ones (s.t. independent ones can execute). Other ways to prevent dispatch stalls are Compile-time instruction scheduling/reordering (becomes difficult when we have variable-latency instructions), Value prediction (not very accurate) and Fine-grained Multithreading (But hurts single-thread performance) etc. An approach called as Out-of-Order with register renaming was invented by?Robert Tomasulo which was used in IBM 360/91 Floating Point Units [See: Tomasulo, "AN Efficient Algorithm for Exploiting Multiple Arithmetic Units," IBM Journal of R&D, Jan. 1967]. Tomasulo's Algorithm didn't consider precise exceptions at the time and hence it was a nightmare to debug the processor. Hence the major difference today is the precise exceptions along with Out-of-Order execution. Intel Pentium Pro used principles from the below *works to implement OoO execution with precise exceptions.
*Patt, Hwu, Shebanow, "HPS, a new microarchitecture: rationale and Introduction," MICRO 1985.
*Patt et al., "Critical issues regarding HPS, a high performance microarchitecture," MICRO 1985.
Today, OoO execution is used in most high-performance processors. Initially implemented in Intel Pentium Pro, AMD K5, Alpha 21264, MIPS R10000, IBM POWER5, IBM z196, Oracle UltraSPARC T4, ARM Cortex A15, Apple M1 etc.
领英推荐
The advantage of OoO Execution is latency tolerance i.e. it allows independent instructions to execute and complete in the presence of long-latency operations. It also finds and exploits parallel operations in a program. However, it leads to higher complexity and potentially lengthens the critical path delay and more hardware resources are required.
Superscalar Execution
Idea: Fetch, decode, execute, retire multiple instructions per cycle (N-wide superscalar -> N instructions per cycle). Hardware performs the dependence checking between concurrently-fetched instructions. This technique helps in?reaching higher instruction throughputs [See: Smith and Sohi, "The Microarchitecture of Superscalar Processors," Proc. IEEE, Dec. 1995].
VLIW (Very Long Instruction Word)
In Superscalar execution, the hardware fetches multiple instructions at a time and checks dependencies between them whereas in *VLIW architectures, the compiler packs independent instructions in a larger "instruction bundle" to be?fetched and the hardware executes these instructions in bundles concurrently in lock step. Hence there is no need for hardware dependency checking between concurrently-fetched instructions in VLIW model.
*Fisher, "Very Long Instruction Word Architectures and the ELI-512," ISCA 1983 (ELI-Enormously longword instructions)
In lock step execution, if any operation stalls, all the concurrent operations stall. A similar approach was followed by Fisher et al., "Parallel Processing: A Smart Compiler and a Dumb Machine," CC 1984. This philosophy is actually similar to that of RISC (simple instructions and hardware) except that we have multiple instructions executed in parallel. In RISC (John Cocke+, 1970s, IBM 801 minicomputer) compiler does the hard work to translate high-level language code to simple instructions and the hardware does very simple translation/decoding. Similarly, in VLIW (Josh Fisher, ISCA 1983) compiler does the hard work to find instruction level parallelism and the hardware stays as simple and streamlined as?possible leading to simpler design with higher frequency. Some of the commercial VLIW machines are:
Solely-compiler approach of VLIW has several downsides that reduce the performance. The compiler may insert too many NOPs when not enough parallelism is discovered. Static schedule might get intimately tied to microarchitecture and hence code compiled for one generation performs poorly for the next. It has no tolerance for variable or long-latency operations during the lock step. But the indirect advantage of VLIW philosophy has been that most *compiler optimizations developed for VLIW are employed in optimizing compilers in general.
*John Ellis, "Bulldog: A COmpiler for VLIW Architectures," PhD Thesis 1984
*Hwu et al., "The superblock: An effective technique for VLIW and superscalar compilation," The Journal of Supercomputing, 1993.
*Chang et al., "IMPACT: an architectural framework for multiple-instruction-issue processors," ISCA 1991
*Mahike+, "Effective Compiler support for Predicated Execution Using the Hyperblock," MICRO 1992
Accelerating Neural Network Processing with Systolic Arrays
The goal of Systolic arrays is an accelerator design that is regular with high concurrency and balanced computation and I/O (memory) bandwidth. The idea is to replace a single processing element (PE) and carefully orchestrate?the flow of data between the PEs such that they collectively transform a piece of input data before outputting it to memory. This approach maximizes computation done on a single piece of data element brought from memory. The convolution operation and matrix multiplication which are massively used by CNNs and many image processing tasks can be computed very efficiently with Systolic approach. It efficiently makes use of limited memory bandwidth, balances computation to I/O bandwidth availability and is specialized for a computation which needs to be fit PE organization/functions. It ultimately led to bigger ideas like stream processing and pipeline parallelism [See: Suleman+, "Data Marshaling for Multi-core Architectures," ISCA 2010]. The downsides of this approach are it is not good at exploiting irregular parallelism and it is relatively special purpose i.e. it needs software support to be a genaral purpose model. One of the first systolic arrays are implemented in - Annaratone et al.,?"The Warp Computer: Architecture, Implementation and Performance," IEEE TC 1987. Modern systolic arrays form?the core accelerators of 谷歌 's Tensor Processing Units for critical datacenter workloads.
More detailed article on Systolic arrays: https://www.dhirubhai.net/pulse/systolic-arrays-tpu-neeraj-cheryala/
An Alternate Approach to Tomasulo's Algorithm: Decoupled Access/Execute (DAE)
Tomasulo's algorithm is basically too complex to implement. An alternate approach is to decouple operand access and execution via two separate instruction streams that communicate via ISA-visible queues.
The compiler generates two instruction streams (One for execution and the other for memory accesses) and synchronizes the two upon control flow instructions using branch queues. The advantage of this approach is that the execute stream can run ahead of the access stream and vice versa. If 'A' is waiting for memory, 'E' can perform useful work. If 'A' hits in cache, it supplies data to the lagging 'E'. Nevertheless, this approach needs compiler support to partition the program and manage queues and hence the compiler determines?the amount of decoupling. Pentium 4 also involves the decoupling approach with separate memory micro-operations queue and Integer/Floating-point microoperations queue.
*Smith, "Dynamic Instruction Scheduling and the Astronautics ZS-1," IEEE Computer 1989.
Exploiting Data Prallelism: SIMD Processors and GPUs
#SIMD (Single-Instruction-Multiple-Data) is one of the types of processors as proposed in Flynn's taxonomy of computers. [See: Mike Flynn, "Very High-Speed Computing Systems," Proc. of IEEE, 1966], the others being SISD, MISD and MIMD (similar to systolic arrays). In SIMD processors, the concurrency arises from performing the same operation on different pieces of data (Data Parallelism). This is in contrast with the dataflow execution where the concurrency arises from executing different operations in parallel. SIMD exploits operation-level parallelism on different data. Two classes of SIMD processors are #Array processors (where?an instruction operates on multiple data elements at the same time using different processing elements) and Vector?Processors (where an instruction operates on multiple data elements in consecutive time steps using the same space). A #vector processor is one whose instructions operate on vectors rather than scalar values. Hence the basic requirements are to load/store vectors i.e. having vector #registers and to operate on vectors of different lengths maintained by vector length register. Elements of a vector might be stored apart from each other in memory. Hence a vector stride register is also needed to correctly fetch elements of a vector.
A vector instruction performs an operation on each element of a vector in consecutive cycles. Vector functional units are pipelined and each pipeline stage operates on a different data element. The catch is that the vector instructions allow deeper pipelines with no intra-vector dependencies and no control flow within a vector. Additionally, known stride allows easy address calculation for all vector elements. The downside is that this approach works only if the? parallelism is regular. One other limitation of SIMD processing is the Amdahl's law which states that the maximum speedup that we can achieve depends on the fraction of code that can be parallelized [See: Amdahl, "Validity of single processor approach to achieving large scale computing capabilities," AFIPS 1967]. Basically, all parallel machines suffer from the serial bottleneck. Other limitations include the imbalance between the compute/memory operations i.e. memory bandwidth can easily become a bottleneck.
Many existing ISAs include vector-like SIMD operations. For example, *Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD etc. Modern machine learning accelerators like Wafer-Scale engines are essentially MIMD machines with SIMD processors [See: Rocki et al., "Fast stencil-code computation on a wafer-scale processor," SC 2020].
*Peleg and Weiser, "MMX technology Extension to the Intel Architecture," IEEE Micro, 1996.
Most modern SIMD processors are a combination of both array and vector processors with GPUs as a prime example. In GPUs, the instruction pipeline operates like a SIMD pipeline i.e. an array processor. However, the programming is?done using threads but not SIMD instructions [Single Program Multiple Data model]. Every SIMD lane is used by a completely independent thread. This is?essentially SIMD processing coupled with Fine-grained #Multithreading. Each thread executes the same code but operates on a different piece of data and each thread has its own context. A set of #threads executing the same instruction are dynamically grouped into a #warp by the hardware.
Traditional SIMD contains a single thread i.e. sequential instruction execution with lock-step execution in a SIMD instruction. Warp-based SIMD consists of multiple scalar threads executing in a SIMD manner. It doesn't have to be?lock-step and each thread can be treated individually when placed in a different warp. Essentially it is SPMD programming model implemented on SIMD machine. Threads can take different control paths in Warp-based SIMD.
So that was quite an interesting survey of some of the execution paradigms in the field of computer architecture. While the article doesn't include memory, hopefully it will give the readers an idea about microarchitectural aspects to be considered while designing a computing system from the processing standpoint. I'm grateful to Prof. Onur Mutlu for making his lectures available for everyone which have inspired me to write this article.