Xilinx DMA PCIe tutorial-Part 3

Xilinx DMA PCIe tutorial-Part 3

In part 1 of my tutorial I've gone over the basic issues related to DMA. In part 2 I dove dipper and gave my two cents regarding the configuration of the XDMA PCIe core. In the following part 3 I will go over each one of the blocks I've designed and implemented towards a full working functional DMA PCIe system. This is the last part of the 3 parts tutorial.

I'll post here again the block diagram as shown in part 2:

No alt text provided for this image

Block 1 was explained at Part 2 so I'll skip directly to Block 2.

AXI to Native Block - Block 2

When starting to design such a project, one needs to define an external register interface to the outside world. This obviously is based on some form of communication (USB, I2C or… PCIe!). I chose to implement the PCIe, since the whole project is based on PCIe. What we need is a WR/RD command from the driver towards the user logic. A short scheme will best describe it:

No alt text provided for this image

The first thing we need to do is to define an AXI to Native block. Vivado can make this for us. Using the ‘Create and package new IP…’ option (via Tools menu) the user can create the needed source files. Opening this menu will give us several options. We’ll need to define the new IP as a Slave, Lite AXI is enough. All other fields should be left unchanged, as shown below:

No alt text provided for this image
No alt text provided for this image

After Vivado creates the source files, we will change them, as the Master Write/Read commands are internal, and we want to make them external.

The created block defines the numbers of registers named slv_reg0, where’s ‘0’ is the index number of the register. In the screenshot above I’ve defined 4 registers, so the code will have slv_reg0 – slv_reg3. These registers are not used at all in my project. They are internal and only used for debug purpose (configured as RD/WR registers). Looking at the below code snippet we can see that slv_reg_wren represent a Write command from the driver towards the user logic.

No alt text provided for this image

As such, we want to output this WR command outside of this block and use it as a driver write toward our user logic. We’ll define it simple as that:

cpu_wr <= slv_reg_wren; --cpu_wr is an output port

AT this point I want to stop and explain a few things before moving forward.

Actually, I could have defined slv_reg_wren as an output port and delete the above code (remember I wrote I do not use it?), but I wanted to make as little changes as possible in Xilinx code. Since in VHDL we cannot read the state of an output port (unless we’re using VHDL-2008, which I did not use in my code), I preferred to leave slv_reg_wren as a signal and define a new output port called: cpu_wr.

Same rule applies to all other commands as defined in figure above. Xilinx source code uses them as internal signals, which we'll need to define them as external and use them in our user logic code as a method to pass values via PCIe interface to/from our user logic.

At the end of Xilinx automated created code I’ve simply added these lines (after defining them as output ports):

No alt text provided for this image


All in all, we now have a block which enables us to interact with our user logic via PCIe driver!

Now we can move forward ??.

Config AXI Modules - Block 3

In block 3 (“External Config AXI Modules”) we can see 2 sub-blocks. These sub-blocks give the user a full native to AXI solution.

No alt text provided for this image

Starting with config_AXI_core sub-block, this block translates between outside world commands to a full AXI interface, via the config_AXI_to_reg block.

Hard to follow? This is not so complicated…

There are 2 situations in which the AXI core can be configured during run-time of the design:

  • Inside your RTL code (part of a state machine, for example). This is what the config_AXI_to_reg is used for.
  • Nonetheless, If the user wants to read/write registers via a communication interface (UART, Ethernet, or PCIe in my case). In such case, the user must have some sort of an interface block. This is exactly what config_AXI_core is used for.

Regarding config_AXI_to_reg, since my code is not AXI compatible (rather pure ‘native’ code), I must have some sort of Native to AXI interpreters. I could have written all my blocks with AXI interface. A different solution is all my native source blocks will be translated to AXI in order to configure the various AXI blocks in the design.

A rather nicer solution was to work with MicroBlaze for these AXI configurations. Indeed, I admit this is more elegant solution. Still, I chose to work fully with VHDL all the way, and this includes AXI configurations. I think it is simpler eventually to have it all in one RTL block, rather than split it between MicroBlaze and RTL blocks. The drawback is all configurations must be carried out in RTL, which is a bit cumbersome.

Going back to the block diagram, we have various AXI interface blocks. Each one needs to have its own Native_to_AXI configuration interpretation block connected to it.

How do we define such block?

Using the ‘Create and package new IP…’ option in Vivado (via Tools menu), same as we did with the CPU commands, whereas now we’ll define our block as a Master (compared to Slave in the CPU command block).

No alt text provided for this image

Once I created this block, I imported it to my BD as an RTL module. Let’s look together inside the source code created. Looking at the ports created, we first see this input port: 

No alt text provided for this image

which is used by the user to initiate an AXI transaction. Obviously, you’ll need that.

Other important ports include all AXI4 Lite ports. We’ll need part of them, to say the least, for our configuration. 

I wanted to focus on these interesting 3 signals; AXI4 internal state machine signals:

No alt text provided for this image

Putting aside the INIT_COMPARE command which is not needed (only used for comparison in Xilinx’s example source code created), the INIT_WRITE and INIT_READ are used by the user to choose which action is needed (Read or Write).

Eventually, the user should define all AXI4 interfaces as ports (at least what is needed in order to run Xilinx AXI4 state machines used in the source code). In the figure below you can see what I chose to work with as an external AXI4 interface.

No alt text provided for this image

DMA Channel - Block 4

Our next block is a vital one. This is the DMA channel block This block holds all sub-blocks which relates to the functional design of the DMA controller.

No alt text provided for this image

This block configures the registers and activity of the XDMA IP Core. It has various sub-blocks and I’ve numbered them as A-E.

RXTX_DMA_controller – Sub Block A

The controller source code should have various state machines; XDMA config (native to AXI, as I referred to earlier) to interact with the XDMA registers, descriptor_bypass state machine (in case you would want to receive higher bandwidth, this is recommended) and other state machines according to your logic design, but these 2 are the most important ones.

Since I decided to control the descriptors by my own, I had to define an external port to my block and connect it to the input dsc_bypass_h2c/c2h in XDMA.

Figure 6, 7 & 8 from PG195, illustrate how all this should be carried out.

No alt text provided for this image

The user needs to design the state machines according to these figures, straight and simple. Other than that, I did wanted to share a few hints regarding figure 8 (PG195). This figure illustrates a few implementations of the descriptors bypass for the user. You can pass the descriptors in any one of the 3 options as seen in this figure (all or one).

The simplest way, as I see it, is option A. Both Option B and option C force you to set all signals at the same clock tick the dsc_byp_load is sampled high (which is feasible only if all command signals are variables – or else they will be set only next clock), so this is what I did, eventually (option A).

No alt text provided for this image

Furthermore, you could (or should) also define status ports in your block and connect them to the DMA Status Ports at the XDMA. These are good for your state machines, as they are not registers, rather simple I/O ports which makes your design much simpler. Needless to say, all these ports can be accessed by XDMA registers also – but this is far less convenient, as this involves going back and forth between your state machines, adding latency to your design (since you have to wait for ‘AXI done’ signal before issuing another AXI write/read command).

Config AXI to reg – Sub Block B

This sub block is exactly the same sub block I've placed in Block 3 ("Config AXI modules"). Remember the 2 situations I've explained, regarding when does the user need to config AXI core?

So, here I was refering to the first option:

There are 2 situations.... --> Inside your RTL code (part of a state machine, for example). This is what the config_AXI_to_reg is used for.

DDR AXI Master – Sub Block C

Since I wanted to add support with the DDR already installed on board the KCU105, I’ve added this block. This block is connected to the DDR via the AXI_InterSmart_Connect block (block 5, see later on) and gives access to read/write from/to the DDR.

This block is taken as a whole from Silica Github source code (Designing-a-Custom-AXI-Master-using-BFMs). The idea to use it was given to me by Itamar Kahalani, Xilinx FAE. But why do I need it actually? I can use Xilinx ‘create new IP’ method to design a native to AXI block in this simple way:

No alt text provided for this image

The blue sub-block above is the same block which was covered extensively before. So, what is the problem with this method and why didn't I choose it?

The problem is if we want to write/read from the DDR in bursts, using the above method, it cannot be done. Using Xilinx ‘Create and package new IP’ indeed creates an AXI interface the user can modify, but there’s no way we can use an AXI burst mode to write to the DDR.

Using Silica free code is a great solution for this problem. Silica published a full featured source code, including pdf manual, simulation files and practically whatever the user needs in order to interface with any AXI block, including Burst mode. I needed it as a Master so I’ve used the sources taken from this link, but in any case, Silica has the same sources also for Slave mode (Designing-a-Custom-AXI-Slave-using-BFMs). Now my path looks a bit difference:

No alt text provided for this image

It may seem more complicated but on the contrary. You get so many options and new features using Silica source code and this totally levels up your design.

BRAM Logic block - Sub Block D

Since I chose to work with DMA descriptors in bypass mode, I needed a non-volatile memory to store them. The simplest solution would be the BRAM. Bear in mind that if/when the number of descriptors is extremely high (meaning, the driver mapped tons of non-contiguous pages to the DMA), you could end up with overflow in the BRAM, so consider writing the scatter-gather table into the DDR. In my case, this did not happen, so I’ve worked with BRAM.

No alt text provided for this image

When working with BRAM, the user can choose between native interface and AXI. When looking for ‘BRAM’ in the search bar we come with 3 options:

The upper two components are the ones we’re interested in. The first is the AXI BRAM Controller, and as you guessed, it is used when you want to interface with the BRAM using AXI protocol. Looking at the IP customization window:

No alt text provided for this image

You can see the memory depth is greyed out. This is a common question raised which is explained nicely at AR 66103. Only after you give it address range in the address editor, will you be able to configure the memory depth, so take that in mind. You could see I’ve chosen AXI4Lite rather than AXI4. This was sufficient as I did not want/need the full AXI4 protocol and AXI4 Lite suited my need.

Moving next to the BRAM generator, looking at the configuration window, there is a dropdown list which enable to choose between 2 options; “BRAM Controller” Vs. “Stand Alone”. The first relates to the AXI4 interface while the latter to Native (in such case no AXI BRAM controller is needed). 

No alt text provided for this image
No alt text provided for this image

When working with “Stand Alone” mode (Native), pay attention to the small checkbox near the ‘Stand Alone’ mode called: “Generate address interface with 32 bits”.

This checkbox instructs the BRAM to use byte addressing. When unchecked, it reverts to word addressing. Remember that when it is ticked the logic driving the read/write port is incrementing the address by 4 each time, as this could be confusing.

No alt text provided for this image

Naturally, when working with “Stand Alone” mode, the memory depth is not grayed out and the user can change the BRAM size needed, as shown in the picture:


AXI Subset Converter – Sub Block E

Using Xilinx AXI Subset converter is a must in this design. The DMA data keeps flowing without anyway to stop it, so I needed to have some way to back-pressure it. In the picture below you can see the implementation. The tReady port is passed outside the DMA_channel block and will be later used as part of my User logic design to back-pressure the DMA data passed in the PCIe communication.

No alt text provided for this image

AXI Inter-Smart connect - Block 5

Lastly, we must use AXI Interconnect to connect all blocks together. You can use AXI SmartConnect (SMC) or AXI Interconnect (IC). Both will do the job. Just bear in mind that:

  • SMC is better in terms of bandwidth, but logic resources costly.
  • On the other hand, IC has arbitration priority which is not supported in SMC.
  • Resource Utilization for both are here: SMC, IC
  • A nice trick I’ve learnt from Itamar Kahalani, Xilinx FAE, is to combine between the two options. This is a great idea on how to gain benefit by using both IP’s. In the bottom figure you can see how the two are connected.

The user should connect the hungry bandwidth consumers to the SMC (M01_AXI is connected to DDR in my case), while keep the less “important” consumers on the IC. They will also gain access to the DDR, but with less performance than their neighbors in the SMC (note that the IC uses 1 input of the SMC). Also, see that both Resets that goes into the IC, passes via logic AND gate towards the SMC so reset is mutual to both IC and SMC.

No alt text provided for this image

Register Interface and mon_fsm blocks - Block 6

The last 2 blocks gathers all user logic and DMA control registers in one place (mon_fsm = Monitor FSM block, used to hold all state machine debug registers in 1 place). This is done just for making life easier, it is not a must and surely your design can run also when all the external registers' interfaces are wondering around your design. It is a choice of the user whether to place all external register in one block or not. Still, from my experience, it is more convenient when you look for a register at a specific block and you don’t need to remember at which block did you place it.

Both blocks are designed the same. The AXI to Native block (block 2) defines the register interface signals (from AXI to Native). These Native signals interact with the outside world using PCIe interface, and go into these blocks. 


Putting it all together

So, after all is designed and connected, debugged and verified, the last step is to design a DMA test component. Such component can be designed quite easily and I've explained it in Part 2 of this tutorial.

Below you can see the results I have seen using 4 channels of H2C direction after capturing the data and cycles and dividing them with each other (using the Cycle Count and Data Count registers as explained in Part 2). I've calculated the total throughput to a 7.49GB/s.

This is not so accurate, as the tests did not start at the same time (I had a problem with the Linux server I could not run it in 'pure' parallel). Still, I can say I've ran it few times and the average results were all above 7GB/s. Well done, Xilinx!

No alt text provided for this image
No alt text provided for this image

So, last words before I finish this tutorial. As mentioned, I think the XDMA is really a life saver regarding the implementation of DMA with PCIe system. The manual is loaded with registers and other important stuff which seems to be pretty clear for the average and above-average user. There are a few drawbacks I've covered earlier (such as maximum number of channels is 4), but overall, nice job!

Anh Tran

Engineer at BKASIC

2 年

Thank you for sharing with us such an informative and insightful article. By the way, Could you please explain how you predefine the source address (H2C) and destination address (C2H) for the descriptor bypass interface? Because I knew that the DMA engine works with physical address memory and every time on PC, I run the Xilinx driver tools (dma_to_device), I observed that these physical addresses are assigned randomly ( observed on kernel message log). So If you could share with us how you configure those source addresses (H2C) and destination addresses (C2H) for descriptor bypass mode in detail, that would be very helpful. Thank you so much.

回复
Mikail Demirta?

Senior Design and Verification Engineer

3 年

You have written a really useful article .Thanks.

Nice work! Thanks. Did you have a chance to look at RIFFA project? Which one is the best option for resource limited and performance needed systems?

Rahul Verma

Sr. Silicon Design Engineer

5 年

Very nice sir thank you very much

要查看或添加评论,请登录

Roy Messinger的更多文章

  • GPIO and Petalinux - Part 3 (Go, UIO, Go!)

    GPIO and Petalinux - Part 3 (Go, UIO, Go!)

    This is part 3 of the GPIO and Petalinux series of tutorials, aiming at hobbyists and/or professionals, working with…

    6 条评论
  • GPIO and Petalinux - Part 2

    GPIO and Petalinux - Part 2

    This is part 2 of the GPIO and Petalinux series of tutorials, aiming at hobbyists and/or professionals, working with…

    13 条评论
  • GPIO and Petalinux - Part 1

    GPIO and Petalinux - Part 1

    When I first started looking into Petalinux and learn the basics, I thought I should start my journey with a simple…

    4 条评论
  • Xilinx DMA PCIe tutorial-Part 2

    Xilinx DMA PCIe tutorial-Part 2

    In part 1 of my tutorial I've gone over the basic issues related to DMA. I covered the various solutions applicable…

    5 条评论
  • Xilinx DMA PCIe tutorial-Part 1

    Xilinx DMA PCIe tutorial-Part 1

    This document is a thorough tutorial on how to implement a DMA controller with Xilinx IP. My idea was to write a…

    9 条评论
  • Arm based controller - bootcamp course

    Arm based controller - bootcamp course

    An extensive presentation from a course I've developed and taught about ARM architecture and microcontrollers (NXP in…

  • Xilinx vs Intel (Altera) FPGA performance comparison

    Xilinx vs Intel (Altera) FPGA performance comparison

    You're welcome to check out this interesting comparison I've carried out between these 2 vendors. Very interesting and…

    10 条评论