登录查看更多内容

Accelerated Computing Series Part 3: Deep Learning Accelerator on FPGA & Linux Drivers

Neeraj Kumar, PhD

Founder Director - Accordyne | ex-CTO@iRFMW, Huawei, BertLabs | IISc

发布日期: 2024年9月27日

Another weekend, another tech blog! This time we’re going to delve deeper into one of FPGAs’ key strengths, hardware flexibility, by implementing a deep learning accelerator core in the fabric. As you know, GPUs have fixed hardware. The software (e.g., backed by CUDA libraries & the API for Nvidia) needs to be mapped to this hardware architecture to eke out the maximum hardware utilization efficiency and thus, performance per watt. Same applies to FPGAs & the associated software stack, only now you have complete control over the hardware architecture too. In this part, we configure and implement AMD’s Deep learning Processor Unit (DPU) IP core and run a few hands-on inference tasks on popular models such as ResNet.

Like many other vendors, AMD/Xilinx provides a bunch of reconfigurable hardware aimed at different market segments. These segments cater to different needs. For example, accelerator cards meant for datacentres are tuned towards higher throughput serving multiple workloads. Latency is crucial, so they use technologies like high bandwidth memory (HBM), consecutively driving the cost up. Meanwhile for edge computing scenarios, we need better efficiency and power draw. AMD, therefore, provides different deep learning accelerator IP cores suited towards these hardware architectures. The one we’re going to use goes by the name: DPUCZDX8G. It aligns to the nomenclature: DPU – C (convolution) – ZD (Zynq DDR) – X8 (model quantization from 32bit to 8bit via DECENT tool) – G (design target: general purpose, others being High throughput, Low latency, Cost optimized)

There used to be a similar IP core, DNNDK, several years back (I understand that the company behind it was acquired by AMD/Xilinx). I chose to go with the DPUCZDX8G core for two reasons:

it is continuously being updated with support for ever new models,
while DNNDK has been demonstrated to work on the Pynq-Z2 board we’ve been using so far in the series, I now intend to move to a board with bigger fabric and PCIe support. I happen to have a ZCU106 that satisfies both the needs.

Note that DPUCZDX8G currently supports ZCU102, ZCU104, and Kria KV260 (good value proposition) out of the box with sdcard images available for the reference design. However, in my case, I’ll need to port the reference design to my board. This can be done because all MPSoCs are supported by this DPU.

Since this series is more about hands-on accelerated computing with Linux, I encourage you to look up the product guide for details on this IP. I’ll just briefly go over it here taking excerpts from that doc.

The DPUCZDX8G IP is a programmable microcoded compute engine for convolutional neural networks (CNNs). The top-level architecture is illustrated below:

It uses a specialized instruction set allowing for efficient implementation of several CNNs, such as VGG, ResNet, GooglLeNet, YOLO, SSD, MobileNet, etc. The compiled microcode that it executes is generated from a NN graph using Vitis-AI compiler, as will be demonstrated in this blog. The IP fetches instructions from off-chip memory (DDR RAM), while the on-chip memory (Global Memory Pool) is used to buffer input activations, intermediate feature maps, and some metadata for high throughput & efficiency. The Application Processing Unit (APU) runs the program that services interrupts and coordinates data transfers. The processing elements (PEs) do the actual computational grunt work using multipliers, adders, and accumulators through DSP48E slices available in the PL, while BRAM (or Ultra RAM) is used to store the weights of the network.

Quite obviously, the optimizations of a big floating point NN network model leading up to the final NN deployed on the core have to be performed before-hand, in Vitis-AI software stack. This includes quantization, optimization, layer fusion, etc.

Generating HDL

Since the DPU IP is not part of the standard IP repositories, we’ll need to build and integrate it into a Vivado project from scratch. Since I will be using v2.5 of the Vitis-AI tools, download ‘DPUCZDX8G_v2.5.tar.gz’ from the link provided by the DPU TRD document for ZCU102, and extract. Follow the steps below:

Go to ‘<DPU_dir>/prj/Vivado’. You’ll find the xilinx_zcu102_bsp tuned for this reference design. What we need is a bsp for ZCU106. Download the ‘v22.1’ Petalinux bsp for ZCU106 and extract in the same directory.
In the ‘<DPU_dir>/prj/Vivado/scripts’ directory, you’ll find trd_prj.tcl script. In the file, set prj_name to zcu106, prj_part to xczu7ev-ffvc1156-2-e. The prj_board is insignificant, but you may put xilinx.com:zcu106:part0:2.6. This scripts sets the project parameters including the DPU default configurations (more on this later) and then sources the trd_bd.tcl in the base directory that eventually build the block design.
Run ‘vivado -source scripts/trd_prj.tcl’

You should see the project block design as shown below:

The DPU is wrapped inside a hierarchical block. As you can see, this block has no AXI-Lite or AXI-Stream interfaces. This brings us to our third design element in our ongoing structured analysis of accelerated computing components (control & data planes), AXI-MM or full AXI interface. Since this is memory mapped, the IP interfaces with the DDR via the PS through the M_AXI_HPM0_LPD and S_AXI_HP*_FPD and S_AXI_LPD ports.

Expanding further:

we see that the other hierarchical blocks take care of the clock generation, interrupt vectoring, and aggregating the interconnects. The main block is the DPUCZDX8G DPU. Let’s checkout its default configuration:

You’ll see that it’s configured for 1 DPU with B4096 architecture. Changing these settings here will mess up the block design, and should be done in the script and the script rerun. Other parameters allow changes here, such as Ultra-RAM Use per DPU which was 0 by default and I changed to 50. This prevents my design to be implemented solely using BRAMs and running out of those eventually.

The choice of architecture decides the number of DSP slices you’ll end up using, but also the level of parallelism in convolution operations. There’s a trade-off, you’ll need try and see. You may be able to barely fit the design but not meet the timing. The default value 3 for number of DPUs certainly did not fit for me for the B4096 architecture. So I changed it to 1. Refer the DPU product guide for more details.

Let’s look at the address map:

The DPU control register is at 0x8F000000. It is accessed via a 32 bit wide S_AXI interface. The details of the register map can be found in the product guide. DPU0_M_AXI_DATA* are the data ports that are 128 bit wide. DPU0_M_AXI_INSTR is the 32 bit wide MM interface for instruction fetch, and SFM_M_AXI is the 128 bit wide MM interface for softmax data.

Generate the bitstream and export the hardware (.xsa file). Don’t forget to include the bitstream while exporting.

Following is the post-implementation resource utilization:

Building Linux

If you have been following this series, you should now be well acquainted with the Petalinux build flow. The only extra stuff you need to do this time is to refer the ZCU102 bsp I mentioned earlier and make necessary amendments to the ZCU106 bsp. Follow these steps:

Goto ‘<DPU_dir>/prj/Vivado/xilinx-zcu106-2022.1’, assuming xilinx-zcu106-2022.1 is the extracted bsp directory.
Run the following commands:

$ petalinux-config –get-hw-description=$<DPU_dir>/prj/Vivado/prj/<your_xsa_file>

(refer <DPU_dir>/prj/Vivado/xilinx-zcu102-bsp/project-spec/configs/config to make necessary changes for ZCU106 in the config GUI that opens)

 $ petalinux-config -c rootfs

(refer <DPU_dir>/prj/Vivado/xilinx-zcu102-bsp/project-spec/configs/rootfs_config for necessary changes)

Copy and overwrite all subdirectories in the <DPU_dir>/prj/Vivado/xilinx-zcu102-bsp/project-spec/meta-user/ directory. This will apply the necessary patches to the driver in the kernel, install all the Vitis-AI libraries and dependencies in the rootfs. It’ll also install the ‘resnet50’ application.

Finally,

领英推荐

AI Hardware Round 2: TPU vs. DPU vs. VPU vs. APU vs…

Alex Wang 5 个月前

Unleashing Apple Silicon's Machine Learning Prowess: A…

Bojan Tunguz, Ph.D. 1 个月前

Maximizing LLMs performance with Intel CPUs

Plain Concepts 4 周前

$ petalinux-build
$ cd images/linux
$ petalinux-package --boot --fsbl zynqmp_fsbl.elf --u-boot u-boot.elf --pmufw pmufw.elf --fpga system.bit --force

Burn the sdcard and boot the board.

Run Resnet50

Once booted, run the following commands:

root@xilinx-zcu106-20221:~# cd app/
root@xilinx-zcu106-20221:~/app# cp model/resnet50.xmodel .
root@xilinx-zcu106-20221:~/app# env LD_LIBRARY_PATH=samples/lib samples/bin/resnet50 img/bellpeppe-994958.JPEG
score[945]  =  0.992235     text: bell pepper,
score[941]  =  0.00315807   text: acorn squash,
score[943]  =  0.00191546   text: cucumber, cuke,
score[939]  =  0.000904801  text: zucchini, courgette,
score[949]  =  0.00054879   text: strawberry,
root@xilinx-zcu106-20221:~/app#

What happened here is that we ran a precompiled application binary ‘samples/bin/resnet50’ that passed a sample image ‘img/bellpeppe-994958.JPEG’ to a precompiled CNN model ‘resnet50.xmodel’ pretrained, quantized, optimized and deployed over the DPU accelerator core.

But when did we compile the model? We didn’t, we just used the one that comes default with the reference design, and there are many other models. You may want to checkout the Vitis-AI github page and look for ‘model_zoo’. You’ll see several models with a corresponding model.yaml file listing download links to precompiled models for different hardware platforms.

The trouble is these are compiled for the B4096 architecture. If you try to change your DPU architecture in Vivado and try to use these models against that IP, you’ll get a fingerprint mismatch error.

So, how do we make custom DPU with different architecture and compile, quantize, optimize & deploy a matching model on it. Let’s get to it.

Build Custom IP

Let’s change the DPU architecture to B1024 and place 3 DPUs in the IP. Please make sure to run the script and build fresh Vivado project for this configuration.

Notice that the number of data ports (channels) has now increased to cater to more number of DPUs, and now S_AXI_* ports on the PS end up being used.

Build the bitstream and export the hardware.

The resource utilization and power draw is as follows:

This design seems on par with the previous design, except URAM consumption has gone up and BRAM has gone down.

Build Matching Model

This is a little involved procedure to cover here, please refer the Vitis-AI User Guide. Essentially we need to follow these steps:

Clone Vitis-AI git repo
Install Docker on the Host and pull the latest Vitis-AI docker image.
Setup host cross compiler to build apps and model.
Download the float model from model-zoo. Look for the link for generic model in mode.yaml.
Run quantize.sh in the docker container.
Compile the model using vai_c_tensorflow (if the model is based on Tensorflow framework). This is the crucial part. You’ll need to copy the new ‘arch.json’ file from <DPU_dir>/prj/Vivado/srcs/top/ip/top_DPUCZDX8G_0 to some path that the compiler can find. For our design, the arch.json file contains the fingerprint: {"fingerprint":"0x101000016010402"} which corresponds to a B1024 DPU architecture.
Finally copy the compiled model resnet50.xmodel to the board.

Run Resnet50 Again

root@xilinx-zcu106-20221:~/app# env LD_LIBRARY_PATH=samples/lib samples/bin/resnet50 img/bellpeppe-994958.JPEG
score[945]  =  0.980568     text: bell pepper,
score[941]  =  0.0179597    text: acorn squash,
score[943]  =  0.000894162  text: cucumber, cuke,
score[939]  =  0.000199515  text: zucchini, courgette,
score[940]  =  5.71619e-05  text: spaghetti squash,

How do we know if our DPU matches the model? We can run the following command to check it out:

root@xilinx-zcu106-20221:~# xdputil query
{
    "DPU IP Spec":{
        "DPU Core Count":3,
        "IP version":"v4.0.0",
        "enable softmax":"True"
    },
    "VAI Version":{
        "libvart-runner.so":"Xilinx vart-runner Version: 2.5.0-c26eae36f034d5a2f9b2a7bfe816b8c43311a4f8  2024-09-25-11:10:26 ",
        "libvitis_ai_library-dpu_task.so":"Xilinx vitis_ai_library dpu_task Version: 2.5.0-c26eae36f034d5a2f9b2a7bfe816b8c43311a4f8  2022-06-15 07:33:00 [UTC] ",
        "libxir.so":"Xilinx xir Version: xir-c26eae36f034d5a2f9b2a7bfe816b8c43311a4f8 2024-09-25-11:02:47",
        "target_factory":"target-factory.2.5.0 c26eae36f034d5a2f9b2a7bfe816b8c43311a4f8"
    },
    "kernels":[
        {
            "DPU Arch":"DPUCZDX8G_ISA1_B1024",
            "DPU Frequency (MHz)":325,
            "cu_idx":0,
            "fingerprint":"0x101000016010402",
            "is_vivado_flow":true,
            "name":"DPU Core 0"
        },
        {
            "DPU Arch":"DPUCZDX8G_ISA1_B1024",
            "DPU Frequency (MHz)":325,
            "cu_idx":1,
            "fingerprint":"0x101000016010402",
            "is_vivado_flow":true,
            "name":"DPU Core 1"
        },
        {
            "DPU Arch":"DPUCZDX8G_ISA1_B1024",
            "DPU Frequency (MHz)":325,
            "cu_idx":2,
            "fingerprint":"0x101000016010402",
            "is_vivado_flow":true,
            "name":"DPU Core 2"
        }
    ]
}

Clearly there are 3 DPUs each with "fingerprint":"0x101000016010402" that matches our arch.json file.

Linux Driver

The DPU IP is a fairly complex core and its impossible to cover it’s driver functionality in the limited space here. You should really be using the Vitis-AI software stack to interact with the DPU (via the Vitis-AI Runtime). But the blog wouldn’t be complete without a brief explanation though.

The driver is located at /drivers/misc/xlnx_dpu.c in the kernel (v5.15 I suppose) source tree that Petalinux v22.1 pulls. As expected, it’s a platform driver (misc device located at /dev/dpu):

static struct platform_driver xlnx_dpu_drv = {
	.probe = xlnx_dpu_probe,
	.remove = xlnx_dpu_remove,

	.driver = {
		.name = DRV_NAME,
		.of_match_table = dpu_of_match,
	},
};

that registers a bunch of ioctl commands as part of file operations

static const struct file_operations dev_fops = {
	.owner = THIS_MODULE,
	.unlocked_ioctl = xlnx_dpu_ioctl,
	.mmap = xlnx_dpu_mmap,
};

Beyond that it’s IP specific commands and file interactions.

That’s it folks, now that we have a working DPU core on the PL, the next step would be to create some useful applications, e.g., computer vision (face detection, object detection). We can have a video streaming from a webcam that passes through the DPU accelerator that implements CV functions in real-time on the stream and the output gets displayed on a monitor. Let’s attempt something like this in the next part. See ya!

Other parts in this series:

Part 0: Linux, FPGAs, GPUs, and some coffee!

Part 1: Custom IP & Control Plane

Part 2: Streaming Dataplane & Linux Drivers

Part 4: Smart Camera with NLP on FPGA & Linux

Part 5: FPGA over PCIe & Linux

My previous related series:

Embedded Linux Weekend Hacking: Linux Device Drivers

Faisal Sani Bala

Electronics, Signal Processing, Imaging and AI in Agriculture and Biology

5 个月

Brilliant work sir. But can I try this on a Zedboard?

1 次回应

Ivan I Kavungal

FPGA prototyping | Embedded systems development | Linux BSP | RTL coding | Electronic Board Bringing up

5 个月

Wow great work sir....very informative

1 次回应

查看更多评论

要查看或添加评论，请登录

Neeraj Kumar, PhD的更多文章

Software Defined Radios Part 2: HW/SW Architecture Deep Dive

2024年12月3日

Software Defined Radios Part 2: HW/SW Architecture Deep Dive

Continuing our focus on data movement logistics in SDRs, in this part we dig a little deeper on the hardware…

1 条评论
Software Defined Radios Part 1: TUN/TAP Linux Virtual Network Device

2024年11月22日

Software Defined Radios Part 1: TUN/TAP Linux Virtual Network Device

Software Defined Radios (SDRs) have been transforming modern digital communications, migrating far complex signal…

2 条评论
Accelerated Computing Series Part 5: FPGA over PCIe & Linux

2024年11月5日

Accelerated Computing Series Part 5: FPGA over PCIe & Linux

Over the past few years, edge-computing has gained tremendous interest for applications involving mobile platforms such…

4 条评论
Accelerated Computing Series Part 4: Smart Camera with NLP on FPGA & Linux

2024年10月1日

Accelerated Computing Series Part 4: Smart Camera with NLP on FPGA & Linux

Building on the heels of the last blog, here’s a quick demonstration of an application that leverages the DPU…
Accelerated Computing Series Part 2: Streaming Dataplane & Linux Drivers

2024年9月19日

Accelerated Computing Series Part 2: Streaming Dataplane & Linux Drivers

In this part of the series we explore the dataplane. In Part 1, we used an AXI-Lite interface (for the control plane)…

2 条评论
Accelerated Computing Series Part 1: Custom IP & Control Plane

2024年9月16日

Accelerated Computing Series Part 1: Custom IP & Control Plane

Welcome to Part 1 of the series where we explore the basics of IP generation and its control. The more usual approach…

3 条评论
Accelerated Computing Series Part 0: Linux, FPGAs, GPUs, and some coffee!

2024年9月13日

Accelerated Computing Series Part 0: Linux, FPGAs, GPUs, and some coffee!

Welcome to this fun series on accelerated computing where we explore ideas on designing co-processors/accelerators…

1 条评论
Embedded Linux Weekend Hacking, Part 5: Interfacing Sensors via the Industrial I/O (IIO) Subsystem

2024年8月4日

Embedded Linux Weekend Hacking, Part 5: Interfacing Sensors via the Industrial I/O (IIO) Subsystem

After a long hiatus, I have decided to get back to what I truly enjoy, technical writing! So, here I am with another…

3 条评论
Embedded Linux Weekend Hacking, Part 4: Platform Device Drivers

2022年4月8日

Embedded Linux Weekend Hacking, Part 4: Platform Device Drivers

This is part 4 of the series 'Weekend Hacking, Embedded Linux Device Driver (LDD) Development from the Ground Up'. So…

2 条评论
Embedded Linux Device Driver Development, Part 3: Coding Your First Driver

2022年3月6日

Embedded Linux Device Driver Development, Part 3: Coding Your First Driver

This is part 3 of the series 'Weekend Hacking, Embedded Linux Device Driver (LDD) Development from the Ground Up'. In…

See all articles

Accelerated Computing Series Part 3: Deep Learning Accelerator on FPGA & Linux Drivers

Neeraj Kumar, PhD

Founder Director - Accordyne | ex-CTO@iRFMW, Huawei, BertLabs | IISc

Generating HDL

Building Linux

领英推荐

Run Resnet50

Build Custom IP

Build Matching Model

Run Resnet50 Again

Linux Driver

Neeraj Kumar, PhD的更多文章

社区洞察

其他会员也浏览了

Industrial GPU Computers

AMR Future Brief| Exploring the Potential of Heterogeneous Integration in Next-Generation Computing

AI Is Eating Software

Purpose-Built Infrastructure: Some problems require a new architecture

A Comparative Analysis of H200 vs. H100 vs. A100 vs. L40S vs. L4 GPUs

Choosing the Right GPU: A Comparative Analysis!

Intel’s Arrow Lake: Embracing the Past, Forget all that Futureproofing and AI

Building the Future of MLOps with GPUs: Speed, Scalability and Efficiency

AI Chips: The Powerhouse of Sustainable Computing

AMD's CUDA Challenge

Generating HDL

Building Linux

领英推荐

Run Resnet50

Build Custom IP

Build Matching Model

Run Resnet50 Again

Linux Driver

Neeraj Kumar, PhD的更多文章

Software Defined Radios Part 2: HW/SW Architecture Deep Dive

Software Defined Radios Part 1: TUN/TAP Linux Virtual Network Device

Accelerated Computing Series Part 5: FPGA over PCIe & Linux

Accelerated Computing Series Part 4: Smart Camera with NLP on FPGA & Linux

Accelerated Computing Series Part 2: Streaming Dataplane & Linux Drivers

Accelerated Computing Series Part 1: Custom IP & Control Plane

Accelerated Computing Series Part 0: Linux, FPGAs, GPUs, and some coffee!

Embedded Linux Weekend Hacking, Part 5: Interfacing Sensors via the Industrial I/O (IIO) Subsystem

Embedded Linux Weekend Hacking, Part 4: Platform Device Drivers

Embedded Linux Device Driver Development, Part 3: Coding Your First Driver

社区洞察

其他会员也浏览了

Industrial GPU Computers

AMR Future Brief| Exploring the Potential of Heterogeneous Integration in Next-Generation Computing

AI Is Eating Software

Purpose-Built Infrastructure: Some problems require a new architecture

A Comparative Analysis of H200 vs. H100 vs. A100 vs. L40S vs. L4 GPUs

Choosing the Right GPU: A Comparative Analysis!

Intel’s Arrow Lake: Embracing the Past, Forget all that Futureproofing and AI

Building the Future of MLOps with GPUs: Speed, Scalability and Efficiency

AI Chips: The Powerhouse of Sustainable Computing

AMD's CUDA Challenge