4x4 Systolic Array Matrix Multiplication scalable to (256x256)
Venkata Guru Prasanth Mulleti
Electrical Engineer | Semiconductor Industry | ASIC Design | Digital Design | Physical Design | Digital Design Verification
The central component of the TPU is its systolic array, which comprises a grid of Multiply-Accumulate (MAC) units arranged in a N×N configuration, where N equals 256. Operating on a weight-stationary architecture, the TPU loads weights into the MAC array beforehand, while activations are introduced from the activation storage buffer. Activations traverse horizontally from the left side to the right, while partial sums descend vertically from the top to the bottom. The output of the matrix multiplication is directed to the activation unit, which offers hardware assistance for prevalent activation functions.
Systolic Array:
To construct a 4x4 systolic array, it is necessary to incorporate 16 MAC units and establish proper interconnections between them to ensure synchronized data flow for matrix multiplication operations. The depicted figure illustrates the flow of inputs and outputs within the systolic array design. Initially, when the control signal is set to 1, the weights are loaded from the weights memory. Once the weights are fully loaded, then when the control signal is given 0, the activations from the activation memory are processed in a systolic manner. To achieve the desired acc_out, it is essential to ensure that a10 arrives one cycle later than a00, a20 arrives one cycle later than a10, and a30 arrives one cycle later than a20, thus ensuring that acc_out in the intermediate stages are prepared accordingly. Similarly, the outputs are expected to follow a systolic pattern, resembling a pipeline fashion. For instance, Yoo emerges at the 4th cycle, followed by Yo1 at the 5th cycle, and so forth.
I implemented a weight-stationary 4*4 Systolic array in Verilog which is also scalable to 256x256 and verified the design in Modelsim and conducted logic synthesis using Synopsys Design Compiler. When designing a systolic array, attention to three critical aspects is essential: ensuring the design and functionality of the MAC unit, optimizing the Systolic array design and dataflow, and ensuring correct connections of MAC units within the systolic array.
MAC (Multiply and Accumulate) unit:
In the above figure MAC unit module performs the multiplication between data, weight and accumulates/ sums the resultant product with acc_in value. Its inputs are acc_in (32bits), data_in (8bits), wt_path_in (8bits), global controls (clock, control) and the outputs are Acc_out (32bit), data_out (8bit), wt_path_out (8bit). Since we are implementing weight stationary logic here, the MAC unit holds the weight values when control is given 0. When control input is 1 the weight will be loaded into the PE's.
MMU (Matrix Multiplication Unit):
The MMU (Matrix Multiplication Unit) module is the top-level module that represents a systolic array for matrix multiplication. It takes several inputs, processes them through multiple MAC (Multiply-Accumulate)units arranged in a 2D array, and produces an output accumulator result. The MAC (Multiply-Accumulate) module represents a single multiply-accumulate unit. It takes inputs, multiplies data with weight, accumulates the results, and produces output data and accumulation. Overall, the MMU module orchestrates the interaction between multiple MAC modules, arranging them in a systolic array fashion to perform matrix multiplication. The MAC module represents a single multiply accumulate operation, with control for weight loading and accumulator reset. The design as a whole is intended for matrix multiplication operations in a systolic array configuration.
Verilog code:
Kindly review my GitHub repository for comprehensive code detailing scalable systolic array matrix multiplication. I experimented with two approaches in Verilog development. The first involved directly mapping the Processing Elements (PEs) for a 4x4 matrix size and routing them accordingly. The second approach, which offers scalability to any NxN matrix size, was implemented using SystemVerilog.
To verify the design, I had written a test bench to perform the matrix multiplication. The testbench can be viewed in GitHub repository. I got the desired output through the waveform which depicts the correct functionality of the system.
Testbench Stimulus:
The weights and data are transmitted as an array, with the 32-bit array being segmented into 8-bit data assigned to the respective processing element (PE) elements. The test values for weight and activation memory are as follows:
领英推荐
Outputs Verified:
Simulation waveform results:
The outputs can be monitored at the lower PE elements within the array, namely {m33, m23, m13, m03}. At the marker's location, we witness the initial output Yoo, represented as 32'h 08 from m03. Subsequently, in the ensuing cycle, an output Y1o of 32'h 07 is observed from m13, and Yo1 of 32'h 15 from m03, and so forth.
Outputs of Logic synthesis:
After performing logic synthesis on the Matrix Multiply Unit(MMU) design using the ASAP7 PDK, I produced reports for Timing, Power, Area, synthesis, and violations. The input for the logic synthesis comprises a behavioral Verilog netlist, while the output consists of a structural Verilog netlist that solely includes standard cell instantiations.
Slack: Slack represents the amount of time by which the arrival time of a signal can exceed its required time without causing timing violations. When I set my clock period to 500 I got a positive slack of 0.84.
Power:
Area:
Doing master's in University at Albany
1 年broo i need help regarding the project can you please message me