How Does Computing Architecture Develop in AI Era?
KUKE ELECTRONICS LIMITED
10+ Years Distributor of Original Electronic Components with Good Price
Heterogeneous computing mean systems formed by more than one kind of processor or core. With the development of the big data, from the popularity of Google TPU to ChatGPT, people require?more powerful computing. In fields such as artificial intelligence (AI) and deep learning, with a large amount of data processing and model training tasks, so they have particularly high requirements for computing power. Therefore how to improve the computing power has become a big problem.
Computing Form Lists
Looking back at the history of computer development, from serial to parallel, from homogeneous to heterogeneous, and will continue to evolve to super heterogeneous:
Serial computing: Single-core CPUs, and ASICs are all serial computing.
Isomorphic parallel computing: CPU multi-core parallel computing
Heterogeneous parallel computing: CPU+GPU, CPU+FPGA, CPU+DSA and SOC belong to it.
In the future, it will move towards the super-heterogeneous parallel computing. Integrate numerous processors to form super heterogeneity.
Serial Computing & Parallel Computing
Serial computing
Software is generally written for serial computing.
1) A problem is decomposed into a set of instruction streams.
2) Execute on a single processor.
3) Instructions are executed sequentially (there may be non-dependent instruction out-of-order inside the processor, but it is still the execution of the serial instruction stream).
Parallel computing
It is the simultaneous use of multiple computing resources to solve a problem.
1) A problem is broken down into parts that can be solved simultaneously.
2) Each part is further decomposed into a series of instructions.
3) The instructions of each part are executed simultaneously on different processors.
4) An overall control/coordination mechanism needs to be employed.
Computational problems should be able to: be decomposed into discrete jobs that can be solved simultaneously; execute multiple program instructions at any time; and solve problems in a shorter time using multiple computing resources than using a single computing resource.
Computing resources are typically: a single computer with multiple processors/cores; any number of such computers connected by a network (or bus).
Multi-core CPU
The figure below shows the internal architecture of Intel Xeon Skylake. It can be seen that this CPU is a homogeneous parallel composed of 28 CPU cores.
The figure below shows the Nvidia’s Turing architecture of GPU. The core processing engine of it consists of the following parts: 6 graphics processing clusters (GPC); each GPC has 6 texture processing clusters (TPC), a total of 36 TPCs; each TPC has 2 streaming multi-core processors (SMs), a total of 72 SMs; each SM consists of 64 CUDA cores, 8 Tensor cores, 1 RT core, and 4 texture units.
Therefore, the Turing architecture GPU has a total of 4608 CUDA cores, 576 Tensor cores, 72 RT cores, and 288 texture units.
Isomorphic to Heterogeneous
Heterogeneous CPU topology
Because the CPU can run autonomously, there are CPU chips based on multi-core CPUs that are isomorphic and parallel. However, there is no parallel computing system in which processor engines other than CPU exist independently, and GPU, FPGA, DSA, ASIC and others all exist as accelerators for the CPU. Therefore, the parallel computing system of other processors is a combination form with CPU, that is, the heterogeneous parallelism of CPU+xPU, which can be roughly divided into three categories:
1) CPU+GPU: It is currently the most popular computing system, which is widely used in scenarios such as HPC (holographic processing unit), graphics and image processing, and AI training/inference.
2) CPU+FPGA: For example, the popular FaaS service in the data center utilizes the partial programmability of FPGA to develop and run the framework, and builds FPGA acceleration solutions for various application scenarios with the help of third-party ISV support or self-developed methods.
3) CPU+DSA: Google TPU is the first DSA architecture processor. TPUv1 adopts the method of independent accelerator to realize heterogeneous parallelism in the way of CPU+DSA .
In addition, it should be noted that due to the fixed function of ASIC (lack of flexibility), there is no heterogeneous computing of CPU+single ASIC. The form of realization is usually CPU+multiple ASICs, or in SOC, exists as a logically independent heterogeneous subsystem, which needs to work together with other systems. Because SOC can be regarded as a system composed of multiple heterogeneous parallel subsystems such as CPU+GPU, CPU+ISP, CPU+Modem, etc.
Super-heterogeneity can also be seen as an organic composition of multiple logically independent heterogeneous subsystems, but SOC is different. Usually, different modules of SOC cannot directly communicate with high-level data, but can indirectly communicate through CPU scheduling.
CPU+GPU
As shown in the figure below, it is a typical GPU server motherboard topology used in machine learning, and is also a typical SOB (system on board). In terms of it, two general-purpose CPUs and 8 GPU accelerator cards are connected through the motherboard. The two CPUs are connected through UPI/QPI; each CPU is connected to a PCIe switch through two PCIe buses; each PCIe switch is connected to two GPUs; in addition, the GPUs are also connected to each other through the NVLink bus.
CPU+DSA
TPU is the industry's first DSA architecture chip. The above picture is the block diagram of TPU v1. It is connected to the CPU through the PCIe Gen3 x16 bus, the instruction is sent from the CPU to the instruction buffer of the TPU, and the operation of the TPU is controlled by the CPU. Further more, the data interaction is carried out between the two memories, the control is initiated by the CPU, and the DMA of the TPU executes the specific data transfer.
领英推荐
Super Heterogeneous Topology
Challenge
In complex computing scenarios such as cloud computing, edge computing, and terminal supercomputers (such as autonomous driving), the requirements for the programmability of chips are very necessary, even higher than the performance. If it is not based on the failure of Moore's Law, the data center will still be dominated by CPUs (although the performance efficiency of it is the lowest).
Performance and programmability are two very important factors that affect the large-scale implementation of large-scale chips. How to balance the two is an eternal topic in the design of large chips.
CPU, GPU, DPU, ASIC and other large computing power chips are facing common challenges, including:
1) The contradiction between single-engine performance and flexibility. The CPU has good flexibility, but the performance is not good enough; the ASIC has the ultimate performance, but the flexibility is poor.
2) The business differences and iteration. In response to this problem, the current main approach is customization. By FPGA, if the scale is too small, and the cost and power consumption are too high; through chip customization, the scenes may be fragmented, and it is difficult to implement large-scale chips, even reduce the cost.
3) Macro computing power requires chips to be able to support large-scale deployment. It is directly proportional to the computing power of the unit chip and the scale of the chip. However, various performance improvement solutions will lose programmable flexibility, making it difficult for chips to achieve this, which will further affect the performance growth. The most typical example is the current difficulty in large-scale implementation of AI chips.
4) The chips should be "universal" enough to be implemented in many scenarios to cut cost.
5) Ecological construction. Large chips require a framework and ecology, the threshold is high and long-term accumulation is required.
6) Convergence of computing platforms. Compatible and fast operations require the construction of a unified hardware platform and system stack.
Solutions
Super-heterogeneity can be seen as a new super-large system formed by the "organic" parallel combination of CPU with other xPU.
It is a new macro system composed of CPU, GPU, FPGA, DSA, ASIC and other various acceleration engines.
The difference between heterogeneous and SOC
For example, one of the heterogeneous processor, HPU (holographic processing unit) can be regarded as an SOC, but it is very different from the traditional SOC. Without recognizing these differences, it is impossible to understand the nature of the HPU. The table below shows some typical differences.
Why Super Heterogeneous Appears?
CPU-Based Moore's Law Invalidation
As the system becomes more complex, high flexible processors are required. And as the performance problems become more and more serious, we need to choose custom accelerated processors. This is a pair of contradictions, having various computing power chip designs. The essential point is: a single processor cannot balance performance and flexibility.
The flexibility of the CPU is very good. In the case of meeting the performance requirements, the CPU is a option for complex computing scenarios such as cloud computing and edge computing. However, limited by this bottleneck and the continuous increase in the demand for computing power, the CPU has gradually become a non-mainstream computing power chip.
For heterogeneous computing of CPU+xPU, since the main computing power is completed by xPU, the xPU determine the performance and the flexibility:
1) CPU+GPU: Although on the basis of sufficient flexibility, it can meet the order of magnitude performance improvement (relative to the CPU), but the computing power efficiency is still not perfect.
2) CPU+DSA: Due to the low flexibility of DSA, it is not suitable for application layer acceleration. A typical case is AI. At present, training and partial reasoning are mainly completed based on CPU+GPU, and AI chips based on DSA architecture have not yet been implemented on a large scale.
Chiplet Technology
If there is no chiplet, the CPU or xPU can integrate 50 cores, while with it, we can integrate 200 cores on a single chip by putting together 4 dies. However, does this really maximize value of chiplet? The answer is not.
Chiplets allow us to build super-large systems with an order of magnitude increase at the level of a single chip. In this way, we can use some "features" of the large system to further optimize. These characteristics are:
1) The complex system is composed of hierarchical and block tasks.
2) The task of the infrastructure layer is relatively certain and suitable for DSA/ASIC. On the basis of satisfying the basic flexibility of the task, the performance can be improved.
3) The tasks of the non-accelerable part of the application are suitable for the CPU. Generally, the applications that users care about account for only 20% of the entire system, and the parts that cannot be accelerated usually account for less than 10%. In order to provide the ultimate user experience, this part of the task is best placed on the CPU.
4) The tasks that can be accelerated by the application are suitable for elastic acceleration engines such as GPUs and FPGAs due to the differences and iterations of the applied algorithms. It can provide the best possible performance while providing sufficient flexibility and programmability.
In this way, where:
1) From a macro perspective, 80% or even more than 90% of calculations are done in DSA. In this way, the whole system is close to the ultimate performance of DSA/ASIC.
2) The non-accelerable part of the application that the user cares about only accounts for less than 10%, and it still runs on the CPU. That is to say, what users see is still 100% CPU-level programmability. In other words, through the ultra-heterogeneous architecture, extreme performance can be got while having ultra-high flexibility.
New Concepts and Technology
On the one hand, heterogeneous programming is difficult. After several years of hard work, Nvidia has made CUDA programming friendly enough to developers and formed a healthy ecology. On the other hand, the difficulty of super-heterogeneity is not only reflected in programming, but also in the design and implementation of processing engines, even the integration of software and hardware. So how to control super heterogeneity better? Just obey the following aspects:
1) Performance and flexibility. From the perspective of the system, the task of the system is transferred from the CPU to the hardware acceleration. That is, how to choose the appropriate processing engine to achieve the best performance and flexibility at the same time.
2) Programming and ease of use. Now the software defined the hardware. It is necessary to consider that how to use existing software resources, and how to integrate into cloud services.
3) User needs. Think about the requirements themselves and the differences. And teaching a man how to fish is worse than teaching him how to fish. In short, provide users with a fully programmable hardware platform with extreme performance to meet what they need.
The integration of software and hardware provides systematic concepts, methods, technologies and solutions to solve the above problems, and provides a feasible path for easily controlling heterogeneity.
The Future: Hyper-Heterogeneous Chips
To ensure the maximization of computing power, on the one hand, it is necessary to continuously improve the performance, and on the other hand, it is necessary to ensure the flexible of the chip. So having super-heterogeneous approach, according to categories and tasks, and adopting the optimal engine solution, we can get what we want easily.
In the future, only super-heterogeneous computing can ensure an order of magnitude increase without losing flexibility and performance. It has great potential, which can better support the development of the digital economy and society.
Anyone interested in semiconductor industry, please follow KUKE Electronics to learn more.