High-Performance Ethernet Using Multiple Cores
A few years ago I had to solve a very challenging problem:
FreeRTOS + LWIP running on ARM, custom DSP algorithm processing packets on SHARC2, and a custom packet router running on SHARC1
I needed to RX and TX 1000 Bytes (each way) of data every 10us from a DSP algorithm over UDP while at the same time running a full TCP stack (for ssh, DHCP, and various low bandwidth control TCP sockets). The DSP took 10us to process the data, so while it was processing packet 2, packet 1 was being transmitted and packet 3 was being received.
T0 = rx packet 0
T1 = process packet 0, rx packet 1
T2 = tx packet 0, process packet 1, rx packet 2
T3 = tx packet 1, process packet 2, rx packet 3
T4 = tx packet 2, process packet 3, rx packet 4
........
Where each time difference is 10us
This is all had to run on an ADI SC589 ARM+2 SHARK SOC.
https://www.analog.com/media/en/dsp-documentation/processor-manuals/SC58x-2158x-hrm.pdf
In order for the DSP to be able to process the data in 10us, the data had to be in on-chip L2 SRAM (running at the same clock as the SHARC) and the DSP had to have all its cycles dedicated to the algorithm. Having high-speed on-chip ram was the key to the success of this project. It allowed for the use of DMA without paying the penalty of non-cache memory access from the DSP. For the high-performance path, the non-cached DDR was only accessed by PDMA or MDMA. Only the slow path, LWIP running on the ARM had a performance hit because its buffers were in the non-cached DDR regions.
The Ethernet driver ran on SHARC1 and DMA'd the packets to and from DDR (in a region with cache disabled). The packet router (also running on SHARC1) peeked into the port field of the packet and if the packet was destined for a specific port the packet was sent to L2 SRAM using Memory to Memory DMA (MDMA). All other packets were processed by LWIP running on the ARM under FreeRTOS directly in DDR. Only the peek into DDR incurred a non-cached access hit.
The most challenging aspects of this project included tuning the MDMA configuration for maximum performance, modifying the LWIP Ethernet driver layer, and signaling (not shown in the above image). All the signaling was done with the SC589 trigger unit. A hardware signaling module that makes it easy to send signals between cores.
- https://www.freertos.org/
- https://savannah.nongnu.org/projects/lwip/
- https://www.analog.com/media/en/technical-documentation/application-notes/EE377v01.pdf
- https://www.analog.com/media/en/technical-documentation/application-notes/EE383v01.pdf