Performance techniques in packet processing applications
I got chance to attend OVS conference recently, where many of the talks focused on improving the performance of OVS. Previously, I got very good opportunity to lead great people in Intoto/Freescale in packet processing area. OVS conference rekindled my memories of Intoto :-). I thought I would list down some of the techniques we developed over the time to improve performance of networking/security applications.
Context/Background - Anatomy of packet processing applications
You can skip this section if you are in hurry to understand performance techniques.
Some examples of packet processing applications are L2 switching, L3 forwarding (Unicast, Multicast), firewall, NAT, SLB, DDoS, IDS/IPS, IPsec, MACSEC, PPP, Openflow/OVS, DTLS, TCP, UPD, TLS, Traffic policing and traffic shaping.
Many packet processing modules, typically have this flow (set of stages):
Validate the packet : Some examples: Checksum validation, ensuring that the length provided in the header is same as the buffer length that holds the packet, Ensuring that any offsets mentioned in the header are within the packet buffer limits, special validations keeping SG (Scatter Gather buffers) in mind etc... And of course any specific validations as required by that packet processing module.
Extract values from the packet: values of needed fields for further processing are extracted at this stage. Few examples:
Search for matching context: Many packet processing modules maintain the context block (CB) for each session. The type of actions to take on the packet depends on the information that is stored in the CB. In some cases, these context blocks acts as cache to configured entries and in some cases they are active context blocks and would need to be there as long as the session is active. For faster lookup, these context blocks are put in data structures such as "Hash Tables", "LPM" etc.. Few examples:
Create matching context: When the matching context is not found, configured policies are searched and matching context is created and placed in appropriate data structures (E.g Exact match hash tables). For example; In case of firewall/NAT/SLB cases, the ACL tables are searched. Based on the matching ACL entry, CB (pair of flow entries and common information block) is created & populated. And then placed the flows in the hash table.
Processing : Packet actions really happen at this stage based on the information that is present in the matching context block.
Output : Transformed packet is sent out to the next module or to the outside world. Note that there few cases where packet may get duplicated and sent out to different modules, packet may get dropped or few input packets may be combined into smaller number of packets.
Timers: Many packet processing applications have associated timers. There are two different types of timers - Inactivity timers and life time timers. Inactivity timers are meant for removing the context blocks when there is no activity. Life time timers are normally used to remove the context blocks after its time is reached. Few examples of these timers: Firewall/NAT/SLB all use inactivity times and Ipsec uses life time timers.
Run to completion model: When there are multiple packet processing modules that need to be applied on a packet, programmers prefer run-to-completion model. In this model, once the packet is received (from hardware or some higher software module) on a core, all modules in sequence are run until the packet is sent out or queued. One such example is combination of traffic policing, firewall/NAT, Outbound Ipsec and traffic shaping. Another alternative is pipeline model, where each module after it applies actions on the packet is queued to next module. There are pros and cons of these approaches. Run-to-completion model avoids multiple enqueues/dequeues and associated locking overheads, avoids multiple context switches and hence the preferred model of developers. But, run-to-completion model, in some cases may not leverage all the cores efficiently. When different fields of the packets are used by various modules, then the possibility of non-uniform core usage is high. Think about this case - Inbound Ipsec, firewall/NAT and traffic shaping. Inbound Ipsec SA CB granularity is "Destination IP" and "SPI" and hence one CB (tunnel) holds packets from multiple firewall/NAT sessions as firewall/NAT sessions are 5-tuple based. If run-to-completion model is adopted, the core which does the Inbound Ipsec processing also runs firewall/NAT module for all inner packets belonging to various firewall/NAT sessions. So, the system has few tunnels compared to cores, then some cores are unused. In case of pipeline model, inbound ipsec module after decapsulating the packets can put the packets in set of queues for firewall/NAT to pickup and process them. By running firewall/NAT in different software threads, there is a big possibility of uniform core usage. But, there are overheads as discussed before. So, many times, the decision is made based on the number of cycles packet processing requires. If there are lot high compared to enqueue/dequeue/context-switching overheads, it makes sense to adopt pipeline model.
Look-aside accelerators : Popular accelerators that are used by packet processing applications include
Look-aside accelerators are good to improve performance and developers tend to use them. If they are not used in asynchronous mode, performance benefits are limited. But, using them in asynchronous mode have maintenance challenges due to following:
Developers have to be more careful in working with asynchronous accelerators.
Requirements
Performance considerations
Many performance challenges revolve around these.
Search
If right data structure and hash algorithm (in case of hash tables) are not used, your search could be a bottleneck.
Locks
In many modules, per packet processing is not very high. Taking locks in the code make the processing serial across multiple cores. More thread one uses, more core cycles get unused. In case of spinlocks, core cycles are literally unused. Moreover, the cycles used in locking and unlocking also add to the packet processing overheads. As much as possible, locks shall be avoided. Let us explore few areas where data structure integrity is to be maintained.
In few cases, locks can be avoided. In some cases, locks can't be avoided and in which case, ensure to use the right operating system locks. For example, if your packet processing application runs in Linux user space, try using futex. Also, we tried to use granular locks as much as possible. For example, hash bucket list manipulation is done on per bucket based locks instead of hash table lock.
Always try to use data structures which are RCU friendly. Read locks with RCUs don't have any overhead (except for memory barriers) and hence packet processing performance does not get impacted during search operation on the data structures.
As described in later section (under packet misordering), if one ensures that packets of any given session are processed by one core at any time, one can avoid taking locks to protect the integrity of session context variables.
On the free queues: Typical technique, we followed is to have core specific and CB specific free queues. Because of this, there is no need to take the lock while allocation and freeing the buffers/context-blocks. But, there could be challenges in terms of non-uniformity. That is, packets are not uniformly processed by cores. Some cores may get more packets to process than others. So, dividing context blocks across core specific queues is not always works best. Hence, it is good to have some minimum number of buffers in core specific queues and have rest of buffer in global queue. During allocation, if there are no buffers/CBs in the core specific queue and then get it from the global queue (which requires lock and that is okay as this should happen infrequently). Also, when the buffer/CB is freed, put it in core specific queue. But if the number of entries exceed the limit on per core specific queue, move the additional ones to global queue. During allocation, if there are no buffers in both core specific queue as well as in the global queue, then go for dynamic allocation. We also find that some core specific queues are under used or not used. We have created some learning mechanisms and increase/decrease the per core queue threshold based on the learning. Since, free queue management is required for almost all packet processing modules, we made this available as a library and hidden all the above complexity in the library.
Statistics counters updates
Many packet processing applications are expected to update multiple statistics (counters and sizes) during its processing at various stages. Since multiple cores act on the same statistics counters, normal practice is to use atomic operations. Atomic operations are expensive. For example, atomic_inc operation loads the value in the cache, gets incremented and stores the value in the cache. Since it needs to be loaded in the cache, it may possibly evict existing data from the cache to make space for it, thereby possibly evicting data that may need to be loaded soon. Note that cores always need data in the cache for any manipulations. Loading of data in the cache from memory can take significant cycles (50 to 100). Cores can't do any work during this time. Hence it is important to avoid using cache for statistics. At the minimum, one should avoid making the cache dirty for other cores.
Techniques:
Use per core statistics counters as much as possible. One can consolidate them while presenting to the users. When per core statistics are used, only its cache line get dirtied, but not other core caches as other cores never load this core statistics counters.
Many modern CPUs provide statistics accelerator (posted writes to memory) with special instructions (increment, add on posted writes). Since statistics variables don't need to be read during packet processing, this is best option for these counters. Use this facility as much as possible.
领英推荐
Caches - Load
It is always best to ensure that cores don't spin for data to be loaded in the cache. Performance can be improved if data, needed by near future processing, is loaded proactively. Many processors provide a way to prefetch the data. Prefetch operation runs in the background by the cores. Intention is that by the time this data is required, it is made available in the cache, thereby avoiding cores spinning cycles. Its helps in improving performance, but one shall not be too greedy.
Prefetching the data too early might not be useful as it may get evicted due to intermediate processing. Moreover, in this process, you may be evicting some important data from the cache.
When there are multiple modules acting on a packet, one technique can be used where current module processing the packet warms the cache with next modules context block information.
Staggered data across multiple cache lines
Packet processing applications read/write from/to context blocks. if the data in CB is spread across in the CB, it may require multiple cache updates, which leads to wasting many core cycles. It is best if all the needed data is loaded at once into the cache. Hence it is important to ensure that related variables in the context block are together. Also, it is important to ensure that the blocks are cache line aligned.
Techniques: Use aligned() attribute while declaring the structures and variables. Also, ensure that all related variables are kept together. It is also good ensure all variables of same type are together to ensure that you don't waste memory.
Branch misprediction
Processors have a logic to predict the branch (if-then-else) and speculatively execute the instructions. If it mispredicts, then it executed instruction result is not only wasted, but also processors need to fetch instructions that can result into 20 cycles or so. It is important to ensure that the misprediction is kept minimum.
Technique: Use 'likely()' and 'unlikely()' compiler attributed to provide hint to the compiler to generate the code that is friendly to processors.
Packet verification and field extraction overheads
CRC and checksum validations are expensive. Also, based on type of processors (little or big endian), network to host conversions and vice versa can be expensive. It is best if these can take advantage of NIC and other hardware features to reduce these overheads. Many NICs can do checksum verification, incoming packet distribution, packet length verification, field extraction in some cases, GRO and TSO. Take advantage of HW that can place the part of the packet in shared cache. Some HW also can warm the cache with the session context information. Use them as much as possible for accelerating the first packet processing module.
Function overheads
Using C functions is important from modularity and easier maintenance perspective. But, functions can add to overheads as there are some cycles used in saving the caller context and restoring it when the function terminates. Try avoiding too many functions in packet processing path by using compiler facilities such as inline functions.
Over cautious programmers
Some programmers are overcautious in that the temporary variables are initialized when they are declared. Since, it is unknown whether that variable would be used in the path, one is better of not initializing when it is declared. Rather initialize it just before it is used first time. Many compilers are very good at reporting warnings if any variable is used without initializing it first and so this technique can be adopted safely.
For/while loops, memcmp, memset, strcmp, strcpy etc...
In the packet processing, one should avoid for/while loops and any copies. If you are using them, ask yourself multiple times whether this can be avoided.
Associating cores for packet queues
Though dedicating cores for a set of packet queues (like RSS queues) is good from cache utilization perspective (as all the packets for a given session are processed by same core and hence any context blocks that are cached get reused as part of new packet processing). But, at the system level, one may not get the performance as packets can't get processed by other free cores. One should weigh on tight association of packet queues to cores vs no association of queues to the cores. Our experience is that good number of organization don't like tight association of cores to the incoming packet queues. Hence, any packet processing application developer shall consider both cases while developing software.
Packet reordering overheads
When packets belonging to sessions get processed by various cores, there is a big possibility that the order of the packets at the output could be different from the order of packets that entered the module. Reordering the packets at the output requires adding sequence number at the time of input and then order the packets based on the sequence number. We found that there are various challenges with packet reordering - Packets may be dropped during packet processing and due to that a sequence number matching gets complicated, packets may get generated by the module, which would not have sequence number. Moreover, packet reordering is expensive operation. On top of it, if we let session be processed by multiple cores, then there would be a need for taking mutexes to protect integrity of state variables of the session. As discussed before, any locks in SMP system adds to more overheads & cycles.
A technique that we followed is to ensure that packets for a given session is processed by one core at any time. That way, there is no need to take the locks to protect the integrity of the state variables. One can avoid packet reordering also as packets of a session are not processed by multiple cores.
In this technique, when the packet arrives at the module and after the session context block is identified, queue the packet if that session is actively used already. If not, mark the session 'active' and process the packet. After handing over the processed packet to next module or queued, process any packets pending in the queue of that context block. If no packets are to be proceed, make the session 'idle'.
Code maintenance considerations
Developers like to get the best performance by saving every CPU cycle. But, it shall not be at the cost of easier maintenance and easier trouble shooting. We made this mistake early in our development and quickly realized that bug fixing and enhancements done by non-original developer takes too much time. Moreover, we realized that the quality started to go down over the time. That, essentially, made us think about optimization opportunities without complicating maintainability. Some of the defects we used to see are:
For double frees, memory leaks, underwrites and overwrites : It is best practice to have memory management library to check for these. To check for double frees, check the block being freed is not in the free list already. To detect memory leaks, create additional memory block to store the stack trace of the caller allocating the context block. Store this additional stack trace block in its own list. Remove the stack trace block from the list, when the context block is free. One can check for memory leaks and offending module by running through the list of stack trace blocks. For underwrites and overwrites, we always used to allocate additional memory - preoverhead and postoverhead. Fill them up with some magic words (0xDEADBEEF). When the entries are asked to be freed, check to ensure that magic words are still intact. If not, flag the error for developers to fix these issues. Note that these techniques add to more cycles. Hence, this is to be controlled by a configuration variable. Turn this in 'ON' in the field to troubleshoot these tough problems.
Challenges with pointer references:
There are two types of interactions where pointers are passed - Timers and Asynchronous accelerators. In both cases, session context pointers are passed. Timer module invokes callback function (to notify inactivity and life time expiry) with the reference to session context block (which was submitted to it when the timer is started). In case of asynchronous accelerators too, they invoke module provided callback function (to notify the result of the operation) with the reference to session context block (which was passed to it during command submission). Since these callback functions are called at later time, one needs to ensure that the context pointers are still valid during callback function processing. Initially, we adopted the concept of 'reference count'. Every time, the pointer is passed, reference count of the context block is incremented. It is decremented when the callback function is called. Memory library is created such a way that, free operation is not successful if the reference count is not zero. But, we found this technique is error prone.
Finally, we went with index based references. Here, instead of passing the pointer references, index to context block and magic number are passed to the timers and asynchronous accelerators. With this, there is no need for maintaining reference count on per context block basis. All contexts are not only arranged in data structure for faster search, but also maintained in an array. Each array element has pointer to the context block and a magic number. In callback, if magic number does not match, then it is assumed that the entry is no longer valid and ignores the callback. With this simple technique, we could avoid many maintenance challenges.
Packet processing friendly Hardware:
See this (from Slide 19 to 33) https://docplayer.net/61475510-Qoriq-ls2085-multicore-processors-with-high-performance-datapath-and-network-peripheral-interfaces.html
In Freescale, we were part of the team that created AIOP (Advanced IO Processor), which is packet processing hardware entity. All our leanings went there. Some of the salient features of AIOP:
We did develop many packet processing applications as shown in the above link : Openflow, Stateful inspection firewall, L3 forwarding, SLB, Ipsec, ROHC, PDCP, RLC, GTP-U and many more...
Summary
With increasing speeds of links and very high E-W traffic growth, packet processing performance is still important even with new high end multi core processors. Some of the techniques that are listed here, in my belief, are still valid.
Innovative Technical Lead | Exposure in 5G, SDN, protocol development, IP Networking, cyber security, embedded technologies.
1 年??
Innovative Technical Lead | Exposure in 5G, SDN, protocol development, IP Networking, cyber security, embedded technologies.
1 年Superb content Srini
Security Researcher and Architect
3 年Great post! Super important!
Principal Engineer at Microsoft.
5 年Superb writing Srini. Great wealth of information. I saw people spending months and months to explore these items. I always feel proud that I worked under your leadership.??
Delivery Manager @ Wesco | IOT Migration | Problem Solver | Cross-functional Leadership
5 年Full recap of our learning during intoto days !! Great article and great learning under your technical leadership Looking back , i feel proud that i was part of "firewall" team