Accelerated Computing Series Part 3: Deep Learning Accelerator on FPGA & Linux Drivers
Neeraj Kumar, PhD
Founder Director - Accordyne | ex-CTO@iRFMW, Huawei, BertLabs | IISc
Another weekend, another tech blog! This time we’re going to delve deeper into one of FPGAs’ key strengths, hardware flexibility, by implementing a deep learning accelerator core in the fabric. As you know, GPUs have fixed hardware. The software (e.g., backed by CUDA libraries & the API for Nvidia) needs to be mapped to this hardware architecture to eke out the maximum hardware utilization efficiency and thus, performance per watt. Same applies to FPGAs & the associated software stack, only now you have complete control over the hardware architecture too. In this part, we configure and implement AMD’s Deep learning Processor Unit (DPU) IP core and run a few hands-on inference tasks on popular models such as ResNet.
Like many other vendors, AMD/Xilinx provides a bunch of reconfigurable hardware aimed at different market segments. These segments cater to different needs. For example, accelerator cards meant for datacentres are tuned towards higher throughput serving multiple workloads. Latency is crucial, so they use technologies like high bandwidth memory (HBM), consecutively driving the cost up. Meanwhile for edge computing scenarios, we need better efficiency and power draw. AMD, therefore, provides different deep learning accelerator IP cores suited towards these hardware architectures. The one we’re going to use goes by the name: DPUCZDX8G. It aligns to the nomenclature: DPU – C (convolution) – ZD (Zynq DDR) – X8 (model quantization from 32bit to 8bit via DECENT tool) – G (design target: general purpose, others being High throughput, Low latency, Cost optimized)
There used to be a similar IP core, DNNDK, several years back (I understand that the company behind it was acquired by AMD/Xilinx). I chose to go with the DPUCZDX8G core for two reasons:
Note that DPUCZDX8G currently supports ZCU102, ZCU104, and Kria KV260 (good value proposition) out of the box with sdcard images available for the reference design. However, in my case, I’ll need to port the reference design to my board. This can be done because all MPSoCs are supported by this DPU.
Since this series is more about hands-on accelerated computing with Linux, I encourage you to look up the product guide for details on this IP. I’ll just briefly go over it here taking excerpts from that doc.
The DPUCZDX8G IP is a programmable microcoded compute engine for convolutional neural networks (CNNs). The top-level architecture is illustrated below:
It uses a specialized instruction set allowing for efficient implementation of several CNNs, such as VGG, ResNet, GooglLeNet, YOLO, SSD, MobileNet, etc. The compiled microcode that it executes is generated from a NN graph using Vitis-AI compiler, as will be demonstrated in this blog. The IP fetches instructions from off-chip memory (DDR RAM), while the on-chip memory (Global Memory Pool) is used to buffer input activations, intermediate feature maps, and some metadata for high throughput & efficiency. The Application Processing Unit (APU) runs the program that services interrupts and coordinates data transfers. The processing elements (PEs) do the actual computational grunt work using multipliers, adders, and accumulators through DSP48E slices available in the PL, while BRAM (or Ultra RAM) is used to store the weights of the network.
Quite obviously, the optimizations of a big floating point NN network model leading up to the final NN deployed on the core have to be performed before-hand, in Vitis-AI software stack. This includes quantization, optimization, layer fusion, etc.
Generating HDL
Since the DPU IP is not part of the standard IP repositories, we’ll need to build and integrate it into a Vivado project from scratch. Since I will be using v2.5 of the Vitis-AI tools, download ‘DPUCZDX8G_v2.5.tar.gz’ from the link provided by the DPU TRD document for ZCU102, and extract. Follow the steps below:
You should see the project block design as shown below:
The DPU is wrapped inside a hierarchical block. As you can see, this block has no AXI-Lite or AXI-Stream interfaces. This brings us to our third design element in our ongoing structured analysis of accelerated computing components (control & data planes), AXI-MM or full AXI interface. Since this is memory mapped, the IP interfaces with the DDR via the PS through the M_AXI_HPM0_LPD and S_AXI_HP*_FPD and S_AXI_LPD ports.
Expanding further:
we see that the other hierarchical blocks take care of the clock generation, interrupt vectoring, and aggregating the interconnects. The main block is the DPUCZDX8G DPU. Let’s checkout its default configuration:
You’ll see that it’s configured for 1 DPU with B4096 architecture. Changing these settings here will mess up the block design, and should be done in the script and the script rerun. Other parameters allow changes here, such as Ultra-RAM Use per DPU which was 0 by default and I changed to 50. This prevents my design to be implemented solely using BRAMs and running out of those eventually.
The choice of architecture decides the number of DSP slices you’ll end up using, but also the level of parallelism in convolution operations. There’s a trade-off, you’ll need try and see. You may be able to barely fit the design but not meet the timing. The default value 3 for number of DPUs certainly did not fit for me for the B4096 architecture. So I changed it to 1. Refer the DPU product guide for more details.
Let’s look at the address map:
The DPU control register is at 0x8F000000. It is accessed via a 32 bit wide S_AXI interface. The details of the register map can be found in the product guide. DPU0_M_AXI_DATA* are the data ports that are 128 bit wide. DPU0_M_AXI_INSTR is the 32 bit wide MM interface for instruction fetch, and SFM_M_AXI is the 128 bit wide MM interface for softmax data.
Generate the bitstream and export the hardware (.xsa file). Don’t forget to include the bitstream while exporting.
Following is the post-implementation resource utilization:
Building Linux
If you have been following this series, you should now be well acquainted with the Petalinux build flow. The only extra stuff you need to do this time is to refer the ZCU102 bsp I mentioned earlier and make necessary amendments to the ZCU106 bsp. Follow these steps:
$ petalinux-config –get-hw-description=$<DPU_dir>/prj/Vivado/prj/<your_xsa_file>
(refer <DPU_dir>/prj/Vivado/xilinx-zcu102-bsp/project-spec/configs/config to make necessary changes for ZCU106 in the config GUI that opens)
$ petalinux-config -c rootfs
(refer <DPU_dir>/prj/Vivado/xilinx-zcu102-bsp/project-spec/configs/rootfs_config for necessary changes)
Copy and overwrite all subdirectories in the <DPU_dir>/prj/Vivado/xilinx-zcu102-bsp/project-spec/meta-user/ directory. This will apply the necessary patches to the driver in the kernel, install all the Vitis-AI libraries and dependencies in the rootfs. It’ll also install the ‘resnet50’ application.
Finally,
领英推荐
$ petalinux-build
$ cd images/linux
$ petalinux-package --boot --fsbl zynqmp_fsbl.elf --u-boot u-boot.elf --pmufw pmufw.elf --fpga system.bit --force
Burn the sdcard and boot the board.
Run Resnet50
Once booted, run the following commands:
root@xilinx-zcu106-20221:~# cd app/
root@xilinx-zcu106-20221:~/app# cp model/resnet50.xmodel .
root@xilinx-zcu106-20221:~/app# env LD_LIBRARY_PATH=samples/lib samples/bin/resnet50 img/bellpeppe-994958.JPEG
score[945] = 0.992235 text: bell pepper,
score[941] = 0.00315807 text: acorn squash,
score[943] = 0.00191546 text: cucumber, cuke,
score[939] = 0.000904801 text: zucchini, courgette,
score[949] = 0.00054879 text: strawberry,
root@xilinx-zcu106-20221:~/app#
What happened here is that we ran a precompiled application binary ‘samples/bin/resnet50’ that passed a sample image ‘img/bellpeppe-994958.JPEG’ to a precompiled CNN model ‘resnet50.xmodel’ pretrained, quantized, optimized and deployed over the DPU accelerator core.
But when did we compile the model? We didn’t, we just used the one that comes default with the reference design, and there are many other models. You may want to checkout the Vitis-AI github page and look for ‘model_zoo’. You’ll see several models with a corresponding model.yaml file listing download links to precompiled models for different hardware platforms.
The trouble is these are compiled for the B4096 architecture. If you try to change your DPU architecture in Vivado and try to use these models against that IP, you’ll get a fingerprint mismatch error.
So, how do we make custom DPU with different architecture and compile, quantize, optimize & deploy a matching model on it. Let’s get to it.
Build Custom IP
Let’s change the DPU architecture to B1024 and place 3 DPUs in the IP. Please make sure to run the script and build fresh Vivado project for this configuration.
Notice that the number of data ports (channels) has now increased to cater to more number of DPUs, and now S_AXI_* ports on the PS end up being used.
Build the bitstream and export the hardware.
The resource utilization and power draw is as follows:
This design seems on par with the previous design, except URAM consumption has gone up and BRAM has gone down.
Build Matching Model
This is a little involved procedure to cover here, please refer the Vitis-AI User Guide. Essentially we need to follow these steps:
Run Resnet50 Again
root@xilinx-zcu106-20221:~/app# env LD_LIBRARY_PATH=samples/lib samples/bin/resnet50 img/bellpeppe-994958.JPEG
score[945] = 0.980568 text: bell pepper,
score[941] = 0.0179597 text: acorn squash,
score[943] = 0.000894162 text: cucumber, cuke,
score[939] = 0.000199515 text: zucchini, courgette,
score[940] = 5.71619e-05 text: spaghetti squash,
How do we know if our DPU matches the model? We can run the following command to check it out:
root@xilinx-zcu106-20221:~# xdputil query
{
"DPU IP Spec":{
"DPU Core Count":3,
"IP version":"v4.0.0",
"enable softmax":"True"
},
"VAI Version":{
"libvart-runner.so":"Xilinx vart-runner Version: 2.5.0-c26eae36f034d5a2f9b2a7bfe816b8c43311a4f8 2024-09-25-11:10:26 ",
"libvitis_ai_library-dpu_task.so":"Xilinx vitis_ai_library dpu_task Version: 2.5.0-c26eae36f034d5a2f9b2a7bfe816b8c43311a4f8 2022-06-15 07:33:00 [UTC] ",
"libxir.so":"Xilinx xir Version: xir-c26eae36f034d5a2f9b2a7bfe816b8c43311a4f8 2024-09-25-11:02:47",
"target_factory":"target-factory.2.5.0 c26eae36f034d5a2f9b2a7bfe816b8c43311a4f8"
},
"kernels":[
{
"DPU Arch":"DPUCZDX8G_ISA1_B1024",
"DPU Frequency (MHz)":325,
"cu_idx":0,
"fingerprint":"0x101000016010402",
"is_vivado_flow":true,
"name":"DPU Core 0"
},
{
"DPU Arch":"DPUCZDX8G_ISA1_B1024",
"DPU Frequency (MHz)":325,
"cu_idx":1,
"fingerprint":"0x101000016010402",
"is_vivado_flow":true,
"name":"DPU Core 1"
},
{
"DPU Arch":"DPUCZDX8G_ISA1_B1024",
"DPU Frequency (MHz)":325,
"cu_idx":2,
"fingerprint":"0x101000016010402",
"is_vivado_flow":true,
"name":"DPU Core 2"
}
]
}
Clearly there are 3 DPUs each with "fingerprint":"0x101000016010402" that matches our arch.json file.
Linux Driver
The DPU IP is a fairly complex core and its impossible to cover it’s driver functionality in the limited space here. You should really be using the Vitis-AI software stack to interact with the DPU (via the Vitis-AI Runtime). But the blog wouldn’t be complete without a brief explanation though.
The driver is located at /drivers/misc/xlnx_dpu.c in the kernel (v5.15 I suppose) source tree that Petalinux v22.1 pulls. As expected, it’s a platform driver (misc device located at /dev/dpu):
static struct platform_driver xlnx_dpu_drv = {
.probe = xlnx_dpu_probe,
.remove = xlnx_dpu_remove,
.driver = {
.name = DRV_NAME,
.of_match_table = dpu_of_match,
},
};
that registers a bunch of ioctl commands as part of file operations
static const struct file_operations dev_fops = {
.owner = THIS_MODULE,
.unlocked_ioctl = xlnx_dpu_ioctl,
.mmap = xlnx_dpu_mmap,
};
Beyond that it’s IP specific commands and file interactions.
That’s it folks, now that we have a working DPU core on the PL, the next step would be to create some useful applications, e.g., computer vision (face detection, object detection). We can have a video streaming from a webcam that passes through the DPU accelerator that implements CV functions in real-time on the stream and the output gets displayed on a monitor. Let’s attempt something like this in the next part. See ya!
Other parts in this series:
My previous related series:
Electronics, Signal Processing, Imaging and AI in Agriculture and Biology
5 个月Brilliant work sir. But can I try this on a Zedboard?
FPGA prototyping | Embedded systems development | Linux BSP | RTL coding | Electronic Board Bringing up
5 个月Wow great work sir....very informative