Accelerated Computing Series Part 2: Streaming Dataplane & Linux Drivers
Neeraj Kumar, PhD
AI Scientist | ex-CTO@iRFMW, Huawei, BertLabs | IISc | Reconf. Edge Compute (FPGAs/GPUs), RTOS/Linux | Sensor Fusion | Radar, SDR | Cognitive Systems
In this part of the series we explore the dataplane. In Part 1, we used an AXI-Lite interface (for the control plane) to wrap a simple counter IP, whose MSBs were used to drive a bunch of LEDs. There were no data going in or out of the IP, just the control signals that set or reset the counter. However, several applications require a high speed and/or high throughput dataplane that may need monitoring and configuring (once or on the fly). Such applications may employ a processing IP that consumes a datastream, say from an analog to digital converter (ADC), and sends the processed data elsewhere, maybe a digital to analog converters (DAC). Examples include multi-rate digital signal processing (DSP), model-based controls (even RL), communication modems, jamming & spoofing for electronic countermeasure (ECM) systems in electronic warfare (EW), Radar receivers, computer vision algorithms, and the list goes on.
?We start by designing and implementing a framework that is universally applicable to the kind of applications mentioned above. As we saw in Part 0, AXI-Lite is a watered down AXI protocol suitable for low speeds and single word transaction per request, such as register read/write. This is good for the control plane. AXI-Stream on the other hand supports unidirectional data transfers between peripherals and/or IPs, without the need for any handshaking or addressing. Which also means that we need some kind of transport/adapter if we need to move this streaming data to/from memory, which needs addressing and handshake. The solution is a direct memory access (DMA) controller.
?We begin by taking a reference design from one of the Xilinx’s wikipages. The reference design is oriented towards the Zynq MPSoC platform. Although I have tested it with ZCU106 (XCZU7EV MPSoC family) board, for the sake of keeping this article more accessible, I’ll port that design to our much beloved and much cheaper Pynq-z2 (Zynq 7000 family) board.
Generating HDL
Assuming you know how to create a Vivado project for this board and adding standard IPs,? create a block design as follows:
This simple block design has only one key IP, the AXI DMA controller. The other paraphernalia here is just for plumbing it to the Zynq PS block. You’ll need to double click the PS, find and? enable ‘S AXI HP0’ interface, and PL-PS interrupts. No need to touch other settings as they would have been set by the board preset while creating the project with the Pynq-z2 board.?
The data-path that I have highlighted in orange is a loopback. It begins at the output port M_AXIS_MM2S. This is a master interface, that is AXI-Stream, and is sending data from memory map (MM) interface to a downstream slave streaming (S) interface (Sorry for the colonial terminology, there is no alternative to address these interfaces in Xilinx docs yet). On the other end is S_AXIS_S2MM, a slave interface, that is AXI-Stream, receives the streaming data and pushes it to a MM interface. As we will see in future blogs, if there’s a data processing IP (a DUT, or HIL if you will), this is where you may want to insert it. This is what constitutes the data plane. And the throughput that can be achieved over it is configured by the control plane, the AXI-Lite interface (S_AXI_LITE) of the AXI DMAC. Following depicts the default IP configuration for the DMAC:
Please refer to the AXI DMA product guide for detailed information on these settings. The Scatter Gather Engine enables loading of buffer descriptors (source & destination buffer address) from the memory (even block RAM) thus offloading DMA management from the CPU. Use this to improve DMA transfers significantly. The ‘Width of Buffer Length Register’ specifies the size of the memory buffer to push from memory map to stream. This is critical to achieve good throughput as we’ll see later. The ‘Max Burst Size’ lets you choose the maximum number of data beats (transactions) per transfer. A higher number would let the DMAC retain data bus longer increasing throughput, but will reduce bandwidth for other AXI masters on the bus.
Further configuration for the actual DMA operation and descriptors will come at runtime from software via AXI-Lite.
That’s it, since we don’t have any external ports connected this time, we can go ahead and generate the bitstream. Following this export the hardware as .xsa file. Follow steps described in Part 1.
Building Linux
We follow the same procedure to build Petalinux as described in Part 1:
$ petalinux-create -t project --template zynq --name ptlnx23.1_pynqz2
$ petalinux-config –get-hw-description=</path/to/xsa_file>
$ petalinux-config -c rootfs
Before we build the project, we need to add the driver, a Linux kernel module (LKM), and a userspace application that interacts with the hardware via the LKM. You can get the driver and the app from the Xilinx wiki for the reference design.
Essentially, the reference design expects you to create a device-tree node as follows:
?dma_proxy {
compatible ="xlnx,dma_proxy";
dmas = <&axi_dma_0 0 &axi_dma_0 1>;
dma-names = "dma_proxy_tx", "dma_proxy_rx";
/* dma-coherent; */
};
Since we are using a Zynq 7000 family SoC instead of the MPSoC, I am using the HP0 port of the PS instead of the HPC0 port of the MPSoC in the reference design. Note the C in HPC0 marks the use of cache-coherency in hardware, and thus the use of ‘dma-coherent’ in the device-tree node. I simply disable it as I am not using this feature. If I had to, I would need to use the ACP port instead of the HP0 port on my PS.
But the question is how to add this device-tree node, the driver and the application to our Petalinux build process. This is not described in the reference design, it is expected that you know this. But don’t worry, fortunately for you I’m here to help.
To create an LKM use the following command:
$ petalinux-create -t modules --name dma-proxy –enable
This creates a template module ‘dma-proxy’ in the ‘<petalinux-project>/project-spec/meta-user/recipes-modules’ directory. Simply overwrite the contents of dma-proxy.c in the ‘files’ directory with the one provided by with wiki. Also, copy? dma-proxy.h to that directory. Update the dma-proxy.bb recipe file to include the header file.
Similarly, to create the userspace application, use the following command:
$ petalinux-create -t apps --template c --name dma-proxy-test –enable
And follow the same steps as the LKM template modification.
?To include the device-tree node in the final device-tree, create a ‘system-user-1.dtsi’ in the ‘<petalinux-project>/project-spec/meta-user/recipes-bsp/device-tree/files’ directory. The contents of this file will be our node:
/ {
dma_proxy {
compatible ="xlnx,dma_proxy";
dmas = <&axi_dma_0 0 &axi_dma_0 1>;
dma-names = "dma_proxy_tx", "dma_proxy_rx";
};
};
You’ll also find a system-user.dtsi file in the directory, modify its contents as:
/include/ "system-conf.dtsi"
/include/ "system-user-1.dtsi"
/ {
};
Finally, to update the Petalinux build process, append our new file source to the device-tree.bbappend file:
SRC_URI += " file://system-user-1.dtsi"
Build Petalinux:
$ petalinux-build
Followed by generating the sdcard images:
$ petalinux-package --boot --fsbl --fpga --u-boot --force
You can use ‘dtc’ compiler to reverse regenerate a system.dts from system.dtb in the images folder. You should be able to find the wrapper device node and the DMAC device node in the system.dts:
领英推荐
amba_pl {
#address-cells = <0x01>;
#size-cells = <0x01>;
compatible = "simple-bus";
ranges;
phandle = <0x37>;
dma@40400000 {
#dma-cells = <0x01>;
clock-names = "s_axi_lite_aclk\0m_axi_sg_aclk\0m_axi_mm2s_aclk\0m_axi_s2mm_aclk";
clocks = <0x01 0x0f 0x01 0x0f 0x01 0x0f 0x01 0x0f>;
compatible = "xlnx,axi-dma-7.1\0xlnx,axi-dma-1.00.a";
interrupt-names = "mm2s_introut\0s2mm_introut";
interrupt-parent = <0x04>;
interrupts = <0x00 0x1d 0x04 0x00 0x1e 0x04>;
reg = <0x40400000 0x10000>;
xlnx,addrwidth = <0x20>;
xlnx,include-sg;
xlnx,sg-length-width = <0x1a>;
phandle = <0x13>;
dma-channel@40400000 {
compatible = "xlnx,axi-dma-mm2s-channel";
dma-channels = <0x01>;
interrupts = <0x00 0x1d 0x04>;
xlnx,datawidth = <0x80>;
xlnx,device-id = <0x00>;
};
dma-channel@40400030 {
compatible = "xlnx,axi-dma-s2mm-channel";
dma-channels = <0x01>;
interrupts = <0x00 0x1e 0x04>;
xlnx,datawidth = <0x80>;
xlnx,device-id = <0x00>;
};
};
};
dma_proxy {
compatible = "xlnx,dma_proxy";
dmas = <0x13 0x00 0x13 0x01>;
dma-names = "dma_proxy_tx\0dma_proxy_rx";
};
Note that? dma@40400000 gives the register map of the DMAC as can be verified by Vivado Address Editor tab:
And the wrapper node dma_proxy refers to it in the ‘dmas’ keyword. It assigns the names dma_proxy_tx to the mm2s channel and dma_proxy_rx to the s2mm channel.
Prepare and flash the sdcard as in Part 1 and boot.
Loading the driver
This should be familiar to you if you had seen my embedded Linux device-driver series.
root@ptlnx23:~# modprobe dma-proxy
dma_proxy: loading out-of-tree module taints kernel.
dma_proxy module initialized
Device Tree Channel Count: 2
Creating channel dma_proxy_tx
Allocating memory, virtual address: df100000 physical address: 1f100000
Creating channel dma_proxy_rx
Allocating memory, virtual address: df600000 physical address: 1f600000
Basically, loading the driver matches it with the device-tree nodes’ compatible strings:
static const struct of_device_id dma_proxy_of_ids[] = {
{ .compatible = "xlnx,dma_proxy",},
{}
};
It matches the dma_proxy node. The probe function is then triggered to setup the devices.
?The driver enquires the device-tree to find the number of channels (in the probe function):
/* Figure out how many channels there are from the device tree based
* on the number of strings in the dma-names property
*/
lp->channel_count = device_property_read_string_array(&pdev->dev,
"dma-names", NULL, 0);
if (lp->channel_count <= 0)
return 0;
printk("Device Tree Channel Count: %d\r\n", lp->channel_count);
It finds two: tx & rx channels, following which it creates two char devices: ‘/dev/dma_proxy_tx’ and ‘/dev/dma_proxy_tx’.
/* Create the channels in the proxy. The direction does not matter
* as the DMA channel has it inside it and uses it, other than this will not work
* for cyclic mode.
*/
for (i = 0; i < lp->channel_count; i++) {
printk("Creating channel %s\r\n", lp->names[i]);
rc = create_channel(pdev, &lp->channels[i], lp->names[i], DMA_MEM_TO_DEV);
if (rc)
return rc;
total_count++;
}
...
/* Create a DMA channel by getting a DMA channel from the DMA Engine and then setting
* up the channel as a character device to allow user space control.
*/
static int create_channel(struct platform_device *pdev, struct dma_proxy_channel *pchannel_p, char *name, u32 direction)
Finally, the driver registers a bunch of ioctl commands the userspace app can use to interact with the char devices.
static struct file_operations dm_fops = {
.owner = THIS_MODULE,
.open = local_open,
.release = release,
.unlocked_ioctl = ioctl,
.mmap = mmap
};
The userspace app then interacts with these devices in a multi-threaded fashion. The application sets up the rx & tx (lower priority than rx) threads:
?void setup_threads(int *num_transfers)
{
...
/* Set the transmit priority to the lowest
*/
param.sched_priority = newprio;
pthread_attr_setschedparam (&tattr_tx, ¶m);
for (i = 0; i < RX_CHANNEL_COUNT; i++)
pthread_create(&rx_channels[i].tid, NULL, rx_thread, (void *)&rx_channels[i]);
for (i = 0; i < TX_CHANNEL_COUNT; i++)
pthread_create(&tx_channels[i].tid, &tattr_tx, tx_thread, (void *)&tx_channels[i]);
}
initiates transfers, and captures performance stats:
main()
{
...
start_time = get_posix_clock_time_usec();
setup_threads(&num_transfers);
/* Do the minimum to know the transfers are done before getting the time for performance */
for (i = 0; i < RX_CHANNEL_COUNT; i++)
pthread_join(rx_channels[i].tid, NULL);
/* Grab the end time and calculate the performance */
end_time = get_posix_clock_time_usec();
time_diff = end_time – start_time;
...
}
Let’s run the application a few times with different settings:
root@ptlnx23:~# dma-proxy-test
DMA proxy test
Usage: dma-proxy-test <# of DMA transfers to perform> <# of bytes in each transfer in KB (< 1MB)> <optional verify, 0 or 1>
root@ptlnx23:~# dma-proxy-test 100 4 1
DMA proxy test
Verify = 1
Time: 5425 microseconds
Transfer size: 400 KB
Throughput: 75 MB / sec
DMA proxy test complete
root@ptlnx23:~# dma-proxy-test 100 8 1
DMA proxy test
Verify = 1
Time: 6778 microseconds
Transfer size: 800 KB
Throughput: 120 MB / sec
DMA proxy test complete
root@ptlnx23:~# dma-proxy-test 100 16 1
DMA proxy test
Verify = 1
Time: 9625 microseconds
Transfer size: 1600 KB
Throughput: 170 MB / sec
DMA proxy test complete
root@ptlnx23:~# dma-proxy-test 100 32 1
DMA proxy test
Verify = 1
Time: 15152 microseconds
Transfer size: 3200 KB
Throughput: 216 MB / sec
DMA proxy test complete
root@ptlnx23:~# dma-proxy-test 100 64 1
DMA proxy test
Verify = 1
Time: 26116 microseconds
Transfer size: 6400 KB
Throughput: 250 MB / sec
DMA proxy test complete
root@ptlnx23:~# dma-proxy-test 100 128 1
DMA proxy test
Verify = 1
dmaengine_prep*() error
dmaengine_prep*() error
dmaengine_prep*() error
dmaengine_prep*() error
DMA timed out
DMA timed out
Proxy rx transfer error, # transfers 100, # completed 28, # in progress 32
Proxy tx transfer error
Proxy tx transfer error
We see that the application crashes when transfer size is kept to 128KB. This issue is also reported in the wiki. The proposed solution is to increase the ‘Width of Buffer Length Register’ for larger transfers. So, let’s update our DMAC IP’s configuration:
Build everything and run again:
root@ptlnx23:~# modprobe dma-proxy
dma_proxy: loading out-of-tree module taints kernel.
dma_proxy module initialized
Device Tree Channel Count: 2
Creating channel dma_proxy_tx
Allocating memory, virtual address: df100000 physical address: 1f100000
Creating channel dma_proxy_rx
Allocating memory, virtual address: df600000 physical address: 1f600000
root@ptlnx23:~# dma-proxy-test 100 128 1
DMA proxy test
Verify = 1
Time: 31256 microseconds
Transfer size: 12800 KB
Throughput: 419 MB / sec
DMA proxy test complete
Great, this time it goes through and the throughput is also increased quite significantly. Note that I have also changed the Max Burst Size and Data Widths in the updated config. So they also contribute to the better performance.
Alright! That’s it folks for this part. We saw how we can setup and configure a basic dataplane infrastructure. Later we’ll see how we can insert a DUT IPs that accept stream, process the data, and output another stream. The DMAC in this case fetches the data from memory, streams it to DUT, gets the processed data stream and pushes it back to memory. There may be cases where we’re just interested in memory to stream (DACs) or stream to memory (ADCs). Everything can be configured here.
See you soon!
Other parts in this series:
My previous related series:
Electrical Engineering Student at QUT | Electrical Fitter Mechanic
2 个月Thank you for putting your time into writing this series, you are doing an amazing job!