RTL vs. Software Mentality in FPGA/ASIC Design; Latency From 161 to 2 Clock Cycle!
Yousef B. Bedoustani, PhD. Eng.
Principal FPGA & Hardware Engineer
Yousef B. Bedoustani, Senior FPGA and Real-Time DSP Developer, Montreal, Canada
Why High-Level Synthesis (HLS) tools are not sufficient for FPGA/ASIC design? This is a long-asked question. In other words, we should ask why the building that is developed by an alien from CPU Software planet sometimes looks tilt in RTL planet. Why the HLS tools with the nature of CPU methodology, in RTL planet don’t have the same performance as RTL tools. The answer is hidden within 2 important parameters:
- RTL knowledge or RTL Mentality of developers
- Methodology of High-Level Synthesis (HLS) compilers
The first sentence in RTL coding and FPGA/ASIC design courses is:
Forget All Your CPU Software Mentality!
We know that the CPU software developing is sequential mentality which is different by RTL parallel mentality. This sequential mentality comes from the fact that the CPU software developers should follow the sequential rules that are forced by CPU sequential nature.
By contrast, there is high level of flexibility in RTL digital design for implementation of an algorithm. On the one hand, the RTL implementation could be sequential logic, combinational logic or combination of them. On the other hand, in RTL design one could have several different innovative solution for a unique algorithm. Another property of design in RTL planet is that one could have full control and full observation of the operation of design. By contrast, in the CPU software planet, developers do not need this level of controllability and observability of CPU fixed operations. This is a big property that makes RTL mentality and methodology different from CPU software mentality and methodology.
Obviously the cost of this high level of flexibility is the fact that RTL design is difficult, time consuming in developing and debugging and therefore expensive. This is exactly the reason why HLS tools are interested in RTL planet. However, if the HLS developer doesn’t have RTL mentality and knowledge, the result would be poor.
This LinkedIn article tries to explain the CPU Software Mentality and the effects of HLS methodology on RTL design by using the following simple example:
Example: Design a core that compares N=32 single floating point number with a constant number e.g., 0.5 and count the number of cases that satisfy the condition.
- All inputs are available at the same time in the input of the core
- The goal is to achieve the lowest latency
- Design must meet timing constraint, FPGA clock 200Mhz
- Do not consider other parameters such as FPGA resource utilization, etc.
- Targets are Xilinx Kintex7 and intel Arria10 FPGA
Appendix A shows the Xilinx and Intel solutions including:
- HLS; Vivado HLS and Intel HLS implementation
- Matlab/Simulink HDL Coder; Xilinx and Intel implementation
- Low-Level RTL Design; Xilinx System Generator for DSP (XSG) and Intel DSP Builder implementation, VHDL/Verilog
Latency performance for Xilinx and Intel FPGA are summarized in Table1 and Table2. Although one of the reasons for big difference on the latency (e.g. 161 vs. 2) is the fact that there is not enough flexibility on managing data flow and inout streaming with HLS tools. However, it is clear that RTL design can still catch the best latency performance . By contrast, in RTL design methodology there is full flexibility on management of data flow and inout streaming. This is one of vital RTL properties on which HLS tools don’t have enough control and flexibility so far.
Conclusions
- Some limitations of HLS tools are due to the fact that some HLS properties are inherited from CPU software mentality and methodology
- High-level pure CPU software mentality does not satisfy all RTL design requirements. A software developer that intends to design FPGA/ASIC must break the CPU software mentality that has grown into
- Low-level RTL design is expensive, time-consuming and hard to debug. However, it can catch the best performance
- The combination of Low-level RTL design and high-level design (Model-based or High-level programing such as HLS) is the key point that guarantees the performance and decreases the developing/debugging time as well as cost.
- The Model-Based tools such as Xilinx System Generator (XSG), Intel (Altera) DSP Builder, Matlab/Simulink HDL Coder etc. help improve the RTL implementation. However, they are not sufficient. The low-level RTL innovative mentality of a developer is the key point to achieve the proper performance at acceptable developing time
The important point is, which Mentality is using the HLS tools?
Appendix A
1. HLS Tools
Xilinx Vivado HLS
Xilinx Vivado HLS implementation of the algorithm could be as below:
Without any directive (no RTL knowledge or no RTL mentality), the latency will be 161 clock cycle.
With using “UNROLL factor=16 region” directive, the latency will be 31 clock cycle. In this particular example the For loop involves an if condition and then there is a set bits counter (s++). In this situation, it is impossible to completely unroll the For loop. However, for an RTL developer, it is obvious that the part of For loop can be completely unrolled. This situation is a good example that shows the weakness of pure CPU Software mentality and methodology in RTL digital design.
Finally, with directive ARRY_PARTITION for u input and PIPLINE and UNROLL directive for For loop the latency could be decreased to 19 clock cycles.
Despite the fact that the HLS tools was able to decrease the latency from 161 to 19 clock cycles, the RTL output of the Vivdo HLS is hard to track and it is not clear (Figure 1). In other words, the developer doesn’t necessarily have knowledge about how it is been implemented. However, the operation (not necessary the best performance) is guaranteed by the tools. This property is inherited from CPU software mentality and methodology. In CPU software planet developers don’t necessarily need to know how it is functional with CPU micro-operations. However, don’t forget the simplicity of HLS implementation with 10 lines of C++ code in this example.
Figure 1. Xilinx synthesis and Vivado RTL view for Vivado HLS implementation
Intel HLS
Intel HLS solution could be as below:
With pipelining the loop the latency will be 25clock cycle for the max frequency of 240Mhz.
Figure 2. Intel HLS report for pipeline solution
And with unrolling the loop, the latency will be 47 clock cycle for the max FPGA clock of 240 MHz.
Figure 3. Intel HLS report for unroll solution
2. MATLAB/Simulink HDL Coder
The model-based design such as Matlab/Simulink HDL coder implementation could be as Figure 4. With some RTL knowledge and mentality it is clear that a section of combination of loop and if statement in some situations such as current example can be fully unrolled. The achieved latency by HDL coder is 3 clocks cycle and it passes the timing constraint for either Kintex7 or Arria 10 FPGA.
Figure 4. Matlab/Simulink HDL coder implementation
Figure 5. RTL implementation of adding 32 Boolean input by MATLAB/Simulink HDL Coder
Figure 6. HDL Coder workflow for Xilinx and Intel FPGAs
Figure 7. Vivado RTL view of Xilinx implemention for output of Matlab HDL coder
Figure 8. Intel Quartus RTL view of Matlab HDL Coder implemention
3. Low-Level RTL Mentality
Xilinx System Generator Implementation
Figure 9 and Figure 10 show 2 solutions for the XSG low-level RTL implementation. The figuers visualize the idea and the same algorithm can be easily done using VHDL or Verilog coding.
As shown in Figure 7 ADD blocks are used in chain. All ADD blocks chain can be synthesized for one clock cycle. In Figure 8, the Hamming Weight method is used and the latency will be 2 clock cycle in total.
Figure 9. Xilinx System Generator Implementation (solution 1)
Figure 10. Xilinx System Generator Implementation (solution 2)
The RTL view is shown in the Figure 11 with clear, equal and traceable data path. In RTL planet one should be able to track, monitor, and control the data path. This is vital in analysis of critical path and solving timing problems. This property is because of the nature of design in RTL planet and obviously in Software planet they neither have nor need this property. It seems that this issue is also inherited from software methodology on HLS tools methodology. The problem happens when a software mentality tries to implement an algorithm in RTL planet using HLS tools. Despite the fact that the operation is guaranteed by the HLS tools, the data path is not necessarily traceable and analyzable as it is in RTL design methodology.
Figure 11. Vivado RTL view of XSG solution 1
Intel DSP Builder implementation
Figure12 shows the INTEL DSP Builder implementation. ADD and Compare block in Intel DSP Builder accept vector input.
Figure 12. INTEL DSP Builder Implementation
Figure 13. Quartus RTL view of solution by INTEL DSP Builder
Sales Strategy & Business Management at AMD
3 年Interesting article, I would echo most comments here but in particular I think I'd echo the point that there is a place for HLS and for RTL design, for those functions in your design which truely need the highest and best performance, then RTL is likely the right option, for other where you can allow for some level of performance drop but get an ease of use benefit, then HLS would be the method of choice. That being said to get the best out of HLS, I tend to agree that an understanding of the underlying architecture will allow you get the best results via HLS too.
Quote: "We know that the CPU software developing is sequential mentality which is different by RTL parallel mentality. This sequential mentality comes from the fact that the CPU software developers should follow the sequential rules that are forced by CPU sequential nature." There is a common prejudice: software engineers always think sequentially when building their applications. That's absolutely wrong! Have you ever written multi-process or multi-threaded parallel software applications in C, and faced challenges like race conditions, deadlocks and process/thread synchronization? Have you ever written distributed and parallel software applications involving parallel computing nodes in different locations? These applications have been written ever since the first multi-socket supercomputers were available, and the libraries and methodologies have evolved to produce current technologies like PThreads, OpenMP, Message Passing Interface, etc. for current multithreaded, multicore, networked computing platforms. The programing model for current GPGPU NUMA platforms is very related to the programming model for the platforms I mentioned before. Also, lots of software engineers nowadays face the challenge of optimizing code to take advantage of several micro-architectural features present in modern CPUs that increase Instruction-Level Parallelism (pipelining, superscalar, speculative execution, etc.) Of course, designing parallel digital hardware systems is different to designing parallel software applications, since they live at **different. levels of abstraction**. However, well trained software engineers are able to not only know and exploit the underlying parallel hardware, but to learn the RTL design flow, languages and synthesis tools to design high-performance digital hardware systems.
Very simple example... you need to be able to think in parallel and spatially also
Concepteur en logiciel embarqué - traitement de signal et contr?le - Gentec
4 年Couldn't you use a "foreach" block in Simulink? Would the end result be the same?