Accelerated Computing Series Part 5: FPGA over PCIe & Linux
Neeraj Kumar, PhD
Founder Director - Accordyne | ex-CTO@iRFMW, Huawei, BertLabs | IISc
Over the past few years, edge-computing has gained tremendous interest for applications involving mobile platforms such as drones, both for warfare & civilian use. While these applications have always had stringent SWAP-C requirements, higher on-the-edge compute has emerged equally important. Computer vision based inference, detection, tracking, along with integration of several other sensors such as Radar, Lidar, and multispectral sensor fusion has necessitated more powerful and efficient compute platforms at the edge. In this tech blog, we embark on a journey to bring together the enabling hardware and software towards that goal.
This blog describes interfacing an FPGA with a GPU based single board computer (SBC). While about 40 TOPS performance of the Nvidia Jetson Orin Nano can handle most of the CV based tasks, the FPGA provides a low-latency data pre/post-processing path, high speed ADC data path for RF transceivers, and hardware DSP acceleration for software defined radio (SDR) and Radar.
The FPGA interfaces with the SBC via PCIe. PCIe stands for peripheral components interconnect (express). Note that PCI (now mostly deprecated) is a parallel interface while PCIe is serial.
A typical PCIe architecture looks as follows:
There are three kinds of devices: Root Complex (RC), Bridge (BR), and Endpoint (EP). An RC only has a downstream port, an EP only has an upstream port, while a BR has both upstream and downstream ports.
At system boot up, the RC probes bus 0 for any connected devices, this is called bus enumeration. The RC always has an identifier 0:0.0. The first is the bus number, second is the device number, and third is the function number.
lspci is a utility command (comes with your Linux distro) that lists an overview of all the PCI/PCIe devices on your system. It’s part of the pciutils project. You can checkout the source here. Following is the lspci output on my system:
$ lspci
00:00.0 Host bridge: Intel Corporation Device a700 (rev 01)
00:01.0 PCI bridge: Intel Corporation Device a70d (rev 01)
00:02.0 VGA compatible controller: Intel Corporation Device a780 (rev 04)
00:06.0 PCI bridge: Intel Corporation Device a74d (rev 01)
....
00:1f.3 Audio device: Intel Corporation Device 7a50 (rev 11)
00:1f.4 SMBus: Intel Corporation Device 7a23 (rev 11)
00:1f.5 Serial bus controller: Intel Corporation Device 7a24 (rev 11)
01:00.0 VGA compatible controller: NVIDIA Corporation GA106 [RTX A2000] (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 228e (rev a1)
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
04:00.0 Non-Volatile memory controller: Sandisk Corp Device 5017 (rev 01)
05:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8821CE 802.11ac PCIe Wireless Network Adapter
06:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05)
Take for example: 01:00.0 VGA compatible controller: NVIDIA Corporation GA106 [RTX A2000] (rev a1). On this system, the graphics card is connected to bus 1, has physical device ID 0, and endpoint 0 for graphics functionality. The same device also provides audio functionality (via displayport) and that has EP identifier: 01:00.1.
Thus, the RC enumerates all the EPs on its bus, and recursively enumerates other buses and the EPs attached thereof via the bridges. The tree structure illustrated above can also be visualized at the command line:
$ lspci -tv
-[0000:00]-+-00.0 Intel Corporation Device a700
+-01.0-[01]--+-00.0 NVIDIA Corporation GA106 [RTX A2000]
| \-00.1 NVIDIA Corporation Device 228e
+-02.0 Intel Corporation Device a780
+-06.0-[02]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller...
+-14.0 Intel Corporation Device 7a60
...
+-19.1 Intel Corporation Device 7a7d
+-1b.0-[03]--
+-1b.4-[04]----00.0 Sandisk Corp Device 5017
+-1c.0-[05]----00.0 Realtek Semiconductor Co., Ltd. RTL8821CE 802.11ac PCIe..
+-1c.2-[06]----00.0 Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller
+-1f.0 Intel Corporation Device 7a04
+-1f.3 Intel Corporation Device 7a50
+-1f.4 Intel Corporation Device 7a23
\-1f.5 Intel Corporation Device 7a24
We can see that, for this system, there are six BRs attached to Bus 0 that help in connecting external peripherals like dGPU, WLAN, GbE controllers, etc., rest are all internal peripheral endpoints belonging to Intel CPU.
PCIe Address Spaces
There are two kinds of address spaces available for PCIe devices:
Configuration space for PCI devices is 256 bytes, and for PCIe devices it’s 4KB. In either case the first 64 bytes constitute the PCI header. Following is the layout of the configuration space:
Part of this configuration space is the Header. Following depicts the header structure:
The first 20 bytes are same for both EPs and BRs. The Header Type is 0 for EP and 1 for BR and RC. For completeness, following is the Type 1 header:
For our purposes, we’ll be dealing with Type 0 EP header.
Device Capabilities
At location 0x34h in the header, you’ll find the capabilities pointer. This points to the start address, within the configuration space, of a linked list of device capabilities structures. That is, each structure points to the next structure, until NULL is reached. Since PCIe configuration space is 4K bytes long, devices can be made to have several capabilities. The list looks as follows:
Let’s print out the capabilities of the graphics card:
$ sudo lspci -s 01:0.0 -v
01:00.0 VGA compatible controller: NVIDIA Corporation GA106 [RTX A2000] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 151d
Flags: bus master, fast devsel, latency 0, IRQ 199
Memory at 42000000 (32-bit, non-prefetchable) [size=16M]
Memory at 60000000 (64-bit, prefetchable) [size=256M]
Memory at 70000000 (64-bit, prefetchable) [size=32M]
I/O ports at 5000 [size=128]
Expansion ROM at 43000000 [virtual] [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?>
Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
You can find a full list of capability IDs in this document. We’ll see the capabilities of our FPGA EP in just a bit.
Base Address Registers (BARs)
BARs are the link between the IO/memory space and the configuration space. BARs serve two purposes:
Following is the structure of a BAR. Note that a 64 bit address would take up two successive BARs.
We now have a basic understanding of how to analyze the properties of a device connected over PCIe. Let’s attempt to connect the FPGA board to the host SBC.
Interfacing an Artix-7 FPGA m.2 card with Nvidia Jetson Orin Nano
Following is my setup, an AMD Xilinx Artix-7 FGPA card in m.2 form factor connected to the Jetson over the m.2 nvme slot.
And here’s the reference design:
The AMD Xilinx ‘DMA/Bridge Subsystem for PCI Express’ IP is highlighted. Let’s checkout some of its configuration. The PCIe ID tab in the configuration window shows the default parameters for this reference design.
Many of these user-configurable values, along with those internal to the IP, are part of the EP header, so let’s just print that out:
We can see that apart from the Vendor and Device IDs picked up by lspci, the device has four Capabilities, among which the PCI Express capability is of interest. The link capacity (LnkCap) shows 5GT/s per lane speed and x4 lanes capable device, which is correct because the device has hard PCIe Gen 2 x4 controller onboard. Also, the current link state (LnkSta) shows full capability of the device is available. If I connect this card to the m.2 2230 port on the Jetson, it drops to x2 capacity.
Now let’s check the BARs. lspci reports two memory regions, 64-bit, prefetchable with some address mappings. Going back to the xdma IP config, let’s check the BAR specification tab:
So, how is the host CPU assigning BAR size from the value specified in this tab. We need to refer back to the BAR bits illustrated earlier. The lower nibble, 4, configures the PCIe to AXI Lite Master BAR to a memory region, prefetchable, and 64-bit. The CPU then clears this nibble leading to:
0xFFFFFFFFFFFE0000
Inverting this gives the address range 0x1FFFF. Adding 1 then gives the size 0x20000 which is 128 KB.
The various address offsets within these two interfaces from Vivado are as follows:
Now all you need is the xdma Linux driver provided by Xilinx to interface with the xdma IP appearing as an EP. The driver assigns two char devices for DMA operations on the EP, host to card (h2c) and card to host (c2h). Any user space read writes to these devices are translated to the DDR memory by the driver and the IP. It also assigns an xdma_user char device that manages the AXI-Lite Master interface. Any read/writes to this end up touching the registers. This is useful if you want to control some LEDs through GPIOs, or read/write XADC temperature data.
Here’s one benchmark I ran, writing 256MB of data into the DDR and reading it back:
$ sudo python test-ddr.py -s 256
Sent 256 MiB in 0.260s (1031.686MBPS)
Received 256 MiB in 0.402s (668.222MBPS)
DDR test passed
It’s not currently clear to me why the read bandwidth was consistently lower (almost half) than the write bandwidth. I also ran the test by connecting the FPGA card to m.2 port with x2 lanes.
$ sudo python test-ddr.py
Sent 256 MiB in 0.387s (693.205MBPS)
Received 256 MiB in 0.481s (558.001MBPS)
DDR test passed
This time both are almost same. I’ll see if this issue persists for other boards too.
That’s it for today. I’ve not gone into the details of xdma driver as it’s a complex piece of code involving DMA intricacies. Moreover, I didn’t want to digress from the agenda of this blog, PCIe. We’ll come back to it in the future when we implement kernels on the FPGA accelerator augmenting the capabilities of the host it is attached to. There are several edge computing applications that need low latency and benefit from the hardware parallelism this architecture provides: Radar, camera preprocessing, video transcoding, DSP, etc. We'll touch some of these in the future.
See ya!
Other parts in this series:
My previous related series:
Hua Xuan Sheng (Shenzhen) Electronic Technology Co. -managers (LAN transformers/RJ45 network connectors)
3 个月Hua Xuan Sheng (Shenzhen) Electronic Technology Co. -managers (LAN transformers/RJ45 network connectors)
3 个月100Mbps or 1G motherboard
FPGA Developer
4 个月It was very interesting to read the 5 parts of your series, if you could put it in Hackster would be very helpful for others , with less exploring on LinkedIn