A Survey of Computing Paradigms - From Literature to Machines

Neeraj Kumar Cheryala

Engineer? at Qualcomm

发布日期: 2022年12月12日

It has been proved time and again that designing better #computing systems for the future is only possible by co-designing across the computing stack or what is called as the Transformation Hierarchy involving stages from orchestrating #electrons to building software applications (Ex: Wafer Scale Engine from Cerebras Systems and #tpu from Google are good examples). #ComputerArchitecture has a huge role to play in advancing the design of computing systems for demanding applications in the coming decades. It involves the design of ISA (Instruction-Set Architecture) and #Microarchitecture which form a crucial link?between the application software and the underlying #hardware for effective #computation meeting the #Power, #Performance and #Area criteria. Here's a case-study approach to understand the significance of Computer Architecture: https://www.dhirubhai.net/pulse/why-learn-computer-architecture-case-study-approach-neeraj-cheryala/

This article compiles various fundamental specifications and techniques that a computer engineer might consider when designing a computing system now-a-days. It also presents trade-offs among these techniques in solving a specific problem accompanied by the relevant literature that talk about the approaches.?

What's Computer Architecture?

We can define Computer Architecture as the science and art of designing, selecting and interconnecting hardware components and designing hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost and other specific goals. Gene Amdahl, in IBM Journal of R&D, April 1964 described architecture as the collection of the attributes of a #system as seen by the programmer i.e. the conceptual structure and functional behavior as distinct from the organization of the #dataflow and controls, the #logic design and the physical implementation.

From the above two definitions, it is clear that Computer Architecture involves the study and design of #ISA and?microarchitecture together. Many different ISAs have been introduced over the decades and each of the ISA can have many different implementations (or microarchitectures). Some of the ISAs are listed below:

#x86
PDP-x: Programmed Data Processor (PDP-11)
VAX
IBM 360
CDC 6600
SIMD ISAs: CRAY-1, Connection Machine
VLIW ISAs: Multiflow, Cydrome, IA-64 (EPIC)
PowerPC, POWER ISA
#RISC ISAs: Alpha, #MIPS, SPARC, #ARM, #RISCV, ...

The different implementations of some of the above listed ISAs are:

MIPS: R2000, R3000, R4000, R6000, R8000, R10000, ...
x86: Intel 80486, Pentium, Pentium Pro, Pentium 4, Kaby Lake, Coffee lake, Comet lake, Ice lake, Golden Cove, Sapphire Rapids, ..., AMD K5, K7, K9, Bulldozer, BobCat, Ryzen X, ...
POWER 4, 5, 6, 7, 8, 9, 10 (IBM), ..., PowerPC 604, 605, 620, ...
ARM Cortex-M*, Cortex-A*, NVIDIA Denver, Apple A*, M1, ...
Alpha 21064, 21164, 21264, 21364, ...
RISC-V: Many current-day implementations

Each microarchitecture can take different approaches to solve a specific computing problem under specific design?constraints and goals. Anything done in hardware without exposure to software such as #Pipelining, In-order versus Out-of-order instruction #execution, #memory access scheduling policy, Speculative Execution, Superscalar processing, #Clock gating, #Caching, Prefetching, Voltage/Frequency scaling, Error Correction etc. The remainder of this #article includes most of the fundamental specs and features that #computer engineers usually consider while designing a?computing system and the various approaches as mentioned before.

Specs of an ISA

Execution Model (Ex: Von Neumann*, Dataflow, ...)
Address Space: Total number of uniquely identifiable locations in memory.
Addressability: Number of #bits stored in each address location.
Addressing Modes (Ex: Direct, Indirect, Immediate, ...): Having more addressing modes enables better mapping of?high-level #programming constructs to #hardware. Some accesses are better expressed with a different mode which leads to reduced number of #instructions and code size (Ex: Array Indexing, Pointer-based accesses, Sparse matrix accesses etc). The disadvantage here is it leads to more work for the micro-architect and more options for the #compiler to decide what to use.
Endianness: Alignment of data bits in an address location (Ex: Little Endian and Big Endian)
Instruction Register Size and Encoding (#Opcodes)
ALU Word Length
General Purpose Register Count (GPRs) and number of ports to the #register file: To ensure fast access to values to be processed in the #ALU
Native #datatypes supported (Ex: 2's Complement integers, Floating point, bfloat16, ...): Having different datatypes in an ISA is an example of programmer vs microarchitecture #tradeoff. Having more datatypes enables better mapping of high-level programming constructs to hardware (Ex: Matrix Operations vs. individual multiply/add/load/store instructions). But this leads to more work for the #microarchitect who needs to implement the datatypes and instructions that operate on datatypes.
Stack
Interrupt/ #ExceptionHandling
I/O: Memory-mapped vs. Instructions
Task/Thread Management
#Multithreading and #Multiprocessor Support
Instruction Cycle Phases (Ex: Fetch, Decode, Evaluate Address, Fetch Operands, Execute, Store results, ...)

*Von Neumann Execution Model: Burks, Goldstein, Von Neumann, "Preliminary discussion of the logicaldesign of an electronic computing instrument," 1946.

Semantic Gap and Binary Translators

Semantic Gap tells how close are the instructions, data types and addressing modes are to high-level languages (HLL). Smaller semantic gap leads to easier mapping of HLL to ISA. This means less work for software designer but more work?for hardware designer. Larger semantic gap leads to harder mapping of HLL to ISA. This means more work for software designer but less work for hardware designer i.e. optimization burden on Software.

No alt text provided for this image — Semantic Gap

One way to change the semantic gap tradeoff is to translate from one ISA into a different implementation ISA i.e. from an ISA which is close to HLL with complex instructions, data types and addressing modes to an implementation ISA with simple instructions, data types and addressing modes which is closer to the hardware. This can be done by using a software and hardware translator. For example, in 2020, 苹果 announced *Rosetta 2 would be bundled with #macOS Big Sur, to aid in the MAC transition to Apple Silicon. The software permits many applications compiled exclusively for execution on x86-64-based #processors to be translated for execution on Apple Silicon in the form of arm64 instructions. Even 英特尔 and AMD processors translate the base x86-64 instructions to secret micro-operations to be executed on the hardware.

*More on Rosetta 2: https://en.wikipedia.org/wiki/Rosetta_(software)

Another example for changing the semantic gap is the secret of 英伟达 *Denver for binary translation and Code?#Optimization. The purpose of NVIDIA's Dynamic Code Optimizer (DCO) is to accomplish two tasks: to translate?ARM code to Denver's native format, and to optimize this code to make it run better on Denver. With no out-of-order hardware on Denver, it is the DCO's task to find the instruction-level #parallelism within a thread to fill Denver's many #execution units and to reorder instructions around potential stalls.

*The Secret of Denver: https://www.anandtech.com/show/8701/the-google-nexus-9-review/4

Other example is the Transmeta Code Morphing software which mediates between x86 ISA and a proprietary VLIW ISA to be executed on a Crusoe #processor. [Klaiber, "The Technology Behind Crusoe Processors", Transmeta White Paper 2000.]

Example ways of designing the datapath and Control logic:

Single-cycle, *multi-cycle, #pipelined #datapath and control
Single-bus vs. Multi-bus datapaths
Hardwired/Combinational vs. microcoded/microprogrammed control i.e. Control signals generated by combinational logic versus Control signals stored in a memory location
The concept of multi-cycle, microprogrammed machines: Maurice Wilkes, "The Best Way to Design an Automatic Calculating Machine", Manchester University Computer Inaugural Conf., 1951.

Here are some of the approaches to Instruction-level Concurrency:

Pipelining
Fine-Grained Multithreading
Out-of-Order Execution
Superscalar Execution
VLIW
SIMD Processing
Decoupled Access Execute
Systolic Arrays

Increasing Instruction Processing Throughput: PIPELINING

The basic idea of Pipelining is the execution of multiple instructions by dividing the instruction processing?cycle into distinct stages of processing and process a different instruction in each stage. Instructions consecutive in program order are processed in consecutive stages. One of the early machines with Pipelining is IBM 7030 Stretch designed by Gene Amdahl [See: https://en.wikipedia.org/wiki/IBM_7030_Stretch] There are six fundamental ways?of handling data dependencies in a pipelined processor:

Detect and wait until the value is available in register file (Pipeline Stalls)
Detect and forward data to the dependent instruction: The data value needed by the consumer instruction can be?supplied directly from a later stage in the pipeline (instead of only from the register file). This leads to less stalling.
Detect and eliminate the dependence at the software-level: Reorder/reschedule/insert instructions like 'nop' at the compiler level. For example, MIPS which stands for Microprocessor Without Interlocking Pipeline Stages detects the?dependence between instructions and eliminates them. Two of the approaches to hardware dependence detection are: Scoreboarding and Combinational Dependence Check Logic.?
Detect and move it out of the way for independent instructions
Predict the needed values, execute speculatively and verify: Branch prediction [See: McFarling, "Combining Branch Predictors,"?DEC WRL Technical Report, 1993]
Do something else (like Fine-grained Multithreading as done in GPUs): In Fine-grained #Multithreading, an instruction from a?different thread is fetched every cycle such that no two instructions from the same thread are in the pipeline concurrently. However, this requires multiple thread contexts (PC + registers) in the hardware and threads to be independent of each other. The upside is that no logic is required for handling control and data dependencies within a thread and higher thread-level throughput. The downside is that the single thread performance suffers and extra logic is required for keeping thread contexts. Moreover, there is loss in the throughput when there are not enough threads to keep the pipeline full. For instance, CDC6600's (World's first Out-of-order processor) peripheral processing unit is fine-grained multi-threaded. Here, an operation from the same I/O thread is executed every 10 cycles [Thornton, "Parallel Operations in the Control Data 6600," AFIPS 1964]. Denelcor HEP (Heterogeneous Element Processor) [Smith, "A pipelined, shared resource MIMD computer," ICPP 1978] employs this approach with 120 threads/processor and 8 pipeline stages. Basically, any multi-threaded pipeline can leverage this approach to increase thread-level throughput [Ex: Sun Niagara Multithreaded Pipeline - Kongetira et al., "Niagara: A 32-way Multithreaded Sparc Processor," IEEE Micro 2005].

A Question to Ponder: What is the role of the hardware vs. the software in the order in which instructions are executed in the pipeline?

The decision of hardware/software impacts different metrics such as Performance, Complexity, Power Consumption, Reliability and cost etc. Software-based instruction scheduling is called as Static Scheduling. Here, the instructions are executed in the #compiler-dictated order. In contrast, with dynamic scheduling, hardware can?execute instructions out of the compiler-specified order. But static scheduling can be difficult since compiler may not the information that is determined at runtime such as variable-length operation #latency, memory addresses and branch prediction etc. However the compiler can alleviate this by estimating the unknown at static time by a process called Profiling.

Maintaining Precise Exceptions with Pipelining

There can be exceptions and interrupts in program execution and both these require stopping of the current program, saving the architectural state, handling the exception/ #interrupt and returning back to program execution. Most importantly, the architectural state should be precise when the exception is ready to be handled i.e. all previous instructions should be completely retired and no later instruction should be retired [See Section 6.1 in x86-64 Software Developer's Manual: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html]. This aids in software debugging, enables easy recovery from exceptions and enables traps into software (e.g., software?implemented opcodes). A basic solution to ensure precise exceptions is to make each operation take the same amount of time. The downside to this method that the worst-case instruction latency determines all instructions' latency (worst?is when we consider memory operations as well). Hence there are multiple *solutions proposed for this problem:

Reorder buffer (Mostly implemented as a Circular Queue in Hardware): Complete instructions out-of-order as per their own latency, but reorder them before making results visible to architectural state. Intel Pentium III/Pro is one of the first processors to implement this approach. [Boggs et al., "The Microarchitecture of the Pentium 4 Processor," Intel Technology Journal, 2001]. This conceptually very simple for supporting precise exceptions but it needs to be implemented as a CAM (Content-Addressable Memory) which leads to increased latency and complexity of the pipeline. Other solutions that try to eliminate this disadvantage are mentioned below.
History buffer
Future Register File
Checkpointing

*Smith and Plezskun, "Implementing Precise Interrupts in Pipelined Processors," IEEE Trans on COmputers 1988 and ISCA 1985.

*Hwu and Patt, "Checkpoint Repair for Out-of-order Execution Machines," ISCA 1987.

Out-of-Order Execution (Dynamic Instruction Scheduling)

Although precise exceptions are ensured by reorder buffer, a true data dependence stalls the dispatch of younger instructions into execution units. Basically, if an instruction gets stalled due to not yet ready arguments, later instructions?in the program which are completely independent of this stalled instruction cannot get executed (A dispatch stall). The basic solution is to move the dependent instructions out of the way of independent ones (s.t. independent ones can execute). Other ways to prevent dispatch stalls are Compile-time instruction scheduling/reordering (becomes difficult when we have variable-latency instructions), Value prediction (not very accurate) and Fine-grained Multithreading (But hurts single-thread performance) etc. An approach called as Out-of-Order with register renaming was invented by?Robert Tomasulo which was used in IBM 360/91 Floating Point Units [See: Tomasulo, "AN Efficient Algorithm for Exploiting Multiple Arithmetic Units," IBM Journal of R&D, Jan. 1967]. Tomasulo's Algorithm didn't consider precise exceptions at the time and hence it was a nightmare to debug the processor. Hence the major difference today is the precise exceptions along with Out-of-Order execution. Intel Pentium Pro used principles from the below *works to implement OoO execution with precise exceptions.

*Patt, Hwu, Shebanow, "HPS, a new microarchitecture: rationale and Introduction," MICRO 1985.

*Patt et al., "Critical issues regarding HPS, a high performance microarchitecture," MICRO 1985.

Today, OoO execution is used in most high-performance processors. Initially implemented in Intel Pentium Pro, AMD K5, Alpha 21264, MIPS R10000, IBM POWER5, IBM z196, Oracle UltraSPARC T4, ARM Cortex A15, Apple M1 etc.

领英推荐

Hash Table Internals - Part 7 - Performant Hash Tables

Arpit Bhayani 2 年前

Tearing Down the Memory Wall

Sharada Yeluri 2 年前

A Tale of Memory and the Garbage Collector

Alex Merced 4 年前

The advantage of OoO Execution is latency tolerance i.e. it allows independent instructions to execute and complete in the presence of long-latency operations. It also finds and exploits parallel operations in a program. However, it leads to higher complexity and potentially lengthens the critical path delay and more hardware resources are required.

Superscalar Execution

Idea: Fetch, decode, execute, retire multiple instructions per cycle (N-wide superscalar -> N instructions per cycle). Hardware performs the dependence checking between concurrently-fetched instructions. This technique helps in?reaching higher instruction throughputs [See: Smith and Sohi, "The Microarchitecture of Superscalar Processors," Proc. IEEE, Dec. 1995].

VLIW (Very Long Instruction Word)

In Superscalar execution, the hardware fetches multiple instructions at a time and checks dependencies between them whereas in *VLIW architectures, the compiler packs independent instructions in a larger "instruction bundle" to be?fetched and the hardware executes these instructions in bundles concurrently in lock step. Hence there is no need for hardware dependency checking between concurrently-fetched instructions in VLIW model.

*Fisher, "Very Long Instruction Word Architectures and the ELI-512," ISCA 1983 (ELI-Enormously longword instructions)

In lock step execution, if any operation stalls, all the concurrent operations stall. A similar approach was followed by Fisher et al., "Parallel Processing: A Smart Compiler and a Dumb Machine," CC 1984. This philosophy is actually similar to that of RISC (simple instructions and hardware) except that we have multiple instructions executed in parallel. In RISC (John Cocke+, 1970s, IBM 801 minicomputer) compiler does the hard work to translate high-level language code to simple instructions and the hardware does very simple translation/decoding. Similarly, in VLIW (Josh Fisher, ISCA 1983) compiler does the hard work to find instruction level parallelism and the hardware stays as simple and streamlined as?possible leading to simpler design with higher frequency. Some of the commercial VLIW machines are:

Multiflow TRACE, Josh Fisher (7-wide, 28-wide)
Cydrome Cydra 5, Bob Rau
Transmeta Crusoe: x86 binary-translated into internal VLIW
TI C6000, Trimedia, STMicro (DSP and embedded processors) and some ATI/AMD GPUs
Intel IA-64: Not fully VLIW, but based on VLIW principles-EPIC (Explicitly Parallel Instruction #Computing) where instruction bundles can have dependent instructions

Solely-compiler approach of VLIW has several downsides that reduce the performance. The compiler may insert too many NOPs when not enough parallelism is discovered. Static schedule might get intimately tied to microarchitecture and hence code compiled for one generation performs poorly for the next. It has no tolerance for variable or long-latency operations during the lock step. But the indirect advantage of VLIW philosophy has been that most *compiler optimizations developed for VLIW are employed in optimizing compilers in general.

*John Ellis, "Bulldog: A COmpiler for VLIW Architectures," PhD Thesis 1984

*Hwu et al., "The superblock: An effective technique for VLIW and superscalar compilation," The Journal of Supercomputing, 1993.

*Chang et al., "IMPACT: an architectural framework for multiple-instruction-issue processors," ISCA 1991

*Mahike+, "Effective Compiler support for Predicated Execution Using the Hyperblock," MICRO 1992

Accelerating Neural Network Processing with Systolic Arrays

The goal of Systolic arrays is an accelerator design that is regular with high concurrency and balanced computation and I/O (memory) bandwidth. The idea is to replace a single processing element (PE) and carefully orchestrate?the flow of data between the PEs such that they collectively transform a piece of input data before outputting it to memory. This approach maximizes computation done on a single piece of data element brought from memory. The convolution operation and matrix multiplication which are massively used by CNNs and many image processing tasks can be computed very efficiently with Systolic approach. It efficiently makes use of limited memory bandwidth, balances computation to I/O bandwidth availability and is specialized for a computation which needs to be fit PE organization/functions. It ultimately led to bigger ideas like stream processing and pipeline parallelism [See: Suleman+, "Data Marshaling for Multi-core Architectures," ISCA 2010]. The downsides of this approach are it is not good at exploiting irregular parallelism and it is relatively special purpose i.e. it needs software support to be a genaral purpose model. One of the first systolic arrays are implemented in - Annaratone et al.,?"The Warp Computer: Architecture, Implementation and Performance," IEEE TC 1987. Modern systolic arrays form?the core accelerators of 谷歌 's Tensor Processing Units for critical datacenter workloads.

More detailed article on Systolic arrays: https://www.dhirubhai.net/pulse/systolic-arrays-tpu-neeraj-cheryala/

An Alternate Approach to Tomasulo's Algorithm: Decoupled Access/Execute (DAE)

Tomasulo's algorithm is basically too complex to implement. An alternate approach is to decouple operand access and execution via two separate instruction streams that communicate via ISA-visible queues.

The compiler generates two instruction streams (One for execution and the other for memory accesses) and synchronizes the two upon control flow instructions using branch queues. The advantage of this approach is that the execute stream can run ahead of the access stream and vice versa. If 'A' is waiting for memory, 'E' can perform useful work. If 'A' hits in cache, it supplies data to the lagging 'E'. Nevertheless, this approach needs compiler support to partition the program and manage queues and hence the compiler determines?the amount of decoupling. Pentium 4 also involves the decoupling approach with separate memory micro-operations queue and Integer/Floating-point microoperations queue.

*Smith, "Dynamic Instruction Scheduling and the Astronautics ZS-1," IEEE Computer 1989.

Exploiting Data Prallelism: SIMD Processors and GPUs

#SIMD (Single-Instruction-Multiple-Data) is one of the types of processors as proposed in Flynn's taxonomy of computers. [See: Mike Flynn, "Very High-Speed Computing Systems," Proc. of IEEE, 1966], the others being SISD, MISD and MIMD (similar to systolic arrays). In SIMD processors, the concurrency arises from performing the same operation on different pieces of data (Data Parallelism). This is in contrast with the dataflow execution where the concurrency arises from executing different operations in parallel. SIMD exploits operation-level parallelism on different data. Two classes of SIMD processors are #Array processors (where?an instruction operates on multiple data elements at the same time using different processing elements) and Vector?Processors (where an instruction operates on multiple data elements in consecutive time steps using the same space). A #vector processor is one whose instructions operate on vectors rather than scalar values. Hence the basic requirements are to load/store vectors i.e. having vector #registers and to operate on vectors of different lengths maintained by vector length register. Elements of a vector might be stored apart from each other in memory. Hence a vector stride register is also needed to correctly fetch elements of a vector.

A vector instruction performs an operation on each element of a vector in consecutive cycles. Vector functional units are pipelined and each pipeline stage operates on a different data element. The catch is that the vector instructions allow deeper pipelines with no intra-vector dependencies and no control flow within a vector. Additionally, known stride allows easy address calculation for all vector elements. The downside is that this approach works only if the? parallelism is regular. One other limitation of SIMD processing is the Amdahl's law which states that the maximum speedup that we can achieve depends on the fraction of code that can be parallelized [See: Amdahl, "Validity of single processor approach to achieving large scale computing capabilities," AFIPS 1967]. Basically, all parallel machines suffer from the serial bottleneck. Other limitations include the imbalance between the compute/memory operations i.e. memory bandwidth can easily become a bottleneck.

Many existing ISAs include vector-like SIMD operations. For example, *Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD etc. Modern machine learning accelerators like Wafer-Scale engines are essentially MIMD machines with SIMD processors [See: Rocki et al., "Fast stencil-code computation on a wafer-scale processor," SC 2020].

*Peleg and Weiser, "MMX technology Extension to the Intel Architecture," IEEE Micro, 1996.

Most modern SIMD processors are a combination of both array and vector processors with GPUs as a prime example. In GPUs, the instruction pipeline operates like a SIMD pipeline i.e. an array processor. However, the programming is?done using threads but not SIMD instructions [Single Program Multiple Data model]. Every SIMD lane is used by a completely independent thread. This is?essentially SIMD processing coupled with Fine-grained #Multithreading. Each thread executes the same code but operates on a different piece of data and each thread has its own context. A set of #threads executing the same instruction are dynamically grouped into a #warp by the hardware.

Traditional SIMD contains a single thread i.e. sequential instruction execution with lock-step execution in a SIMD instruction. Warp-based SIMD consists of multiple scalar threads executing in a SIMD manner. It doesn't have to be?lock-step and each thread can be treated individually when placed in a different warp. Essentially it is SPMD programming model implemented on SIMD machine. Threads can take different control paths in Warp-based SIMD.

So that was quite an interesting survey of some of the execution paradigms in the field of computer architecture. While the article doesn't include memory, hopefully it will give the readers an idea about microarchitectural aspects to be considered while designing a computing system from the processing standpoint. I'm grateful to Prof. Onur Mutlu for making his lectures available for everyone which have inspired me to write this article.

要查看或添加评论，请登录

Neeraj Kumar Cheryala的更多文章

ChatGPT Answers: Major Milestones in IoT Industry

2023年4月22日

ChatGPT Answers: Major Milestones in IoT Industry

ChatGPT is an advanced AI language model that is currently making waves in the tech industry. Built on the GPT-3.

1 条评论
Circuit Verification

2021年3月15日

Circuit Verification

The last two articles of this series, 'A Primer on Timing and Verification in Digital Circuits' should have provided…

3 条评论
Timing in Sequential Circuits

2021年2月28日

Timing in Sequential Circuits

In the previous article (https://www.linkedin.

4 条评论
Timing in Combinational Circuits

2021年2月23日

Timing in Combinational Circuits

Circuit design is a trade-off between area (Circuit area is proportional to the cost of the device) and speed /…
Systolic Arrays and the TPU

2020年10月30日

Systolic Arrays and the TPU

Computers have truly transformed our lives over the last three decades. They are now an integral part of our lives.
Why learn Computer Architecture? A Case-study approach.

2020年10月24日

Why learn Computer Architecture? A Case-study approach.

Computers have truly transformed our lives over the last three decades. They are now an integral part of our lives.

1 条评论
SSH and The Power of Public Key Cryptography

2020年7月22日

SSH and The Power of Public Key Cryptography

Basics of Secure Shell (SSH) SSH is essentially a network protocol that leverages cryptography. As you might know, it…
Evolution of Computers

2020年7月12日

Evolution of Computers

This article will present you an intriguing journey in the world of computers and let you navigate through the history…

1 条评论
Decoding the benefits and the downfall of VLAN TRUNKING PROTOCOL (VTP)

2020年7月12日

Decoding the benefits and the downfall of VLAN TRUNKING PROTOCOL (VTP)

In Computer Networks, a VLAN creates a logical broadcast domain across multiple sections of a LAN. VLANs improve the…

2 条评论

See all articles

A Survey of Computing Paradigms - From Literature to Machines

Neeraj Kumar Cheryala

Engineer? at Qualcomm

What's Computer Architecture?

Specs of an ISA

Semantic Gap and Binary Translators

A Question to Ponder: What is the role of the hardware vs. the software in the order in which instructions are executed in the pipeline?

Maintaining Precise Exceptions with Pipelining

Out-of-Order Execution (Dynamic Instruction Scheduling)

领英推荐

Superscalar Execution

VLIW (Very Long Instruction Word)

Accelerating Neural Network Processing with Systolic Arrays

An Alternate Approach to Tomasulo's Algorithm: Decoupled Access/Execute (DAE)

Exploiting Data Prallelism: SIMD Processors and GPUs

Neeraj Kumar Cheryala的更多文章

社区洞察

其他会员也浏览了

DDR prefetching is a technique used in computer architecture to improve system performance by predicting and fetching data from memory.

Back to Bytes: A Time Capsule of Computing and A.I. in 1984 (part i of iii)

IBM's POWER10 chip is too small !!

Cache

gem5: A Detailed Technical Breakdown

Can you provide an explanation of how a computer stores, fetches, and executes instructions? What is the underlying model for this process?

Cracking the Quantum Code: Stabilizers, Errors, and Fault Tolerance

A step-by-step guide to install Intel Advisor and analyze a sample application and find out where Vectorization matters the most

Understanding Computing Power: Computation, Optimization, and Requirements

Workshop: Computation for Real World Control Systems, at IFAC DYCOPS 2025, In-Person in Bratislava, Slovakia on 16 June (Monday afternoon)

What's Computer Architecture?

Specs of an ISA

Semantic Gap and Binary Translators

A Question to Ponder: What is the role of the hardware vs. the software in the order in which instructions are executed in the pipeline?

Maintaining Precise Exceptions with Pipelining

Out-of-Order Execution (Dynamic Instruction Scheduling)

领英推荐

Superscalar Execution

VLIW (Very Long Instruction Word)

Accelerating Neural Network Processing with Systolic Arrays

An Alternate Approach to Tomasulo's Algorithm: Decoupled Access/Execute (DAE)

Exploiting Data Prallelism: SIMD Processors and GPUs

Neeraj Kumar Cheryala的更多文章

ChatGPT Answers: Major Milestones in IoT Industry

Circuit Verification

Timing in Sequential Circuits

Timing in Combinational Circuits

Systolic Arrays and the TPU

Why learn Computer Architecture? A Case-study approach.

SSH and The Power of Public Key Cryptography

Evolution of Computers

Decoding the benefits and the downfall of VLAN TRUNKING PROTOCOL (VTP)

社区洞察

其他会员也浏览了

DDR prefetching is a technique used in computer architecture to improve system performance by predicting and fetching data from memory.

Back to Bytes: A Time Capsule of Computing and A.I. in 1984 (part i of iii)

IBM's POWER10 chip is too small !!

Cache

gem5: A Detailed Technical Breakdown

Can you provide an explanation of how a computer stores, fetches, and executes instructions? What is the underlying model for this process?

Cracking the Quantum Code: Stabilizers, Errors, and Fault Tolerance

A step-by-step guide to install Intel Advisor and analyze a sample application and find out where Vectorization matters the most

Understanding Computing Power: Computation, Optimization, and Requirements

Workshop: Computation for Real World Control Systems, at IFAC DYCOPS 2025, In-Person in Bratislava, Slovakia on 16 June (Monday afternoon)