Data Plane Acceleration to Scale VNFs Better ?
Today enterprises are fully enjoying terabit and petabit speeds of data, but Network bottlenecks still commonly persist both in clouds and on-premise. This is despite the algorithms and software innovations helping the run time code crank at their fullest speed and potential. We all know that a typical cloud infrastructure services a variety of network nodes, encounters off-the-limit egress and ingress dependencies and sometimes adverse end point gateway security implications. But customers at the receiving end are constantly worried when they find QoS and SLA violations with degraded quality of IP flows particularly where MOS scores are involved (voice, video and chat). The biggest question that dominates in the mind of decision makers today is how to scale VNFs better within the budget. This article will pinpoint on how service providers and carriers are justifying the data plane acceleration in their specific use cases. Since different workloads require different processing strategies so carriers have been very keen to attain scale at targeted price points. It is an open secret that for a given workload the latency, security, encryption/decryption, flow processing, application (VNF) processing, networking QoS guarantees, and network monitoring are all different. A “one-size-fits-all” approach has never worked. To support the networking requirements of today’s 5G LTE a concept of network slicing has already been implemented in virtualized environments. But this is a small part of the puzzle only and requires looking at solutions from a different perspective outlined below.
Enough data has been available during last five years in the public domain to accept that Data plane acceleration is an established method to better scale VNFs but the caveats are many. What really goes into this major effort is a lot of planning and many things have to be done right. We all know that every high speed video or multicast workload is very important and behaves like a demanding entity needing some careful planning and may be even a few dry runs in the beginning. This is due to the fact that they can belong to one or more of App, Web or data tiers and will bring definitely additional complexities as number of instances and vm cluster density increases. The issue gets further aggravated when mix of workloads contains application aware and security compliant sensitive mission critical payloads. So what does a customer expect from the IT folks or cloud service providers who deploy and sustain such mix of complex workloads in real time. Obviously customers want deployments to be agnostic to cloud topology, hypervisors used, accelerators used etc. etc. But more than ever they are now laser focused on TCO, ROI, SLA compliance, PCRF, provisioning and billing besides speedy on-line data recovery and disaster recovery, data breaches etc etc.
The Bottlenecks: Data-plane performance challenges in CloudNFV are inherent to key bottlenecks which are impediments to making Cloud NFVs cost effective. Recently PoCs (Proof of Concepts) by many Telecom customers in use cases like 5G LTE, NBIoT and IMS have revealed some significant scaling challenges, business resiliency challenges along with the nagging problems of cost-effective dynamic provisioning. They noticed serious cost overruns during a few 5G trials using VNFs. We also generally forget the rule of thumb that in networking the throughput of packets is limited by the slowest processing step.
Data-plane bottlenecks
1. The virtual switch running on the server platform. This virtual switch must provide sustained, aggregated high-bandwidth network traffic to the VNFs (Virtual Network Functions). Standard virtual switches like Open Vswitch (OVS) do not deliver adequate performance or scalability.
2. Performance of the VNFs themselves : Within an NFV deployment, VNFs such as a virtual evolved packet core (vEPC) or virtual broadband remote access server (vBRAS) must deliver cost-performance comparable to that achieved by equivalent physical implementations Otherwise, NFV deployments will not be cost-effective meaning no ROI justification for a transition to NFV.
The following steps cure some of the problems but more needs to be done (see below Additional Remedies))
a) The switching performance of the virtual switch itself has to be accelerated.
b) The bandwidth of the communication between the host OS running the virtual switch and the guest OSs running the VNFs has to be improved.
c) The data plane performance of the VNFs within the VMs has to be improved because of the poor performance and limited scalability of standard OS networking stacks
d) Better automation techniques for container clusters and VM clusters must be used
Additional Remedies
1. Acceleration through smart NICs or FPGA type solutions: Place workloads on the correct processor type at the correct location in networks. Functionality required by 5G network slicing is network-centric, by definition. Additionally these networking and security workloads are poor prospects for general-purpose architectures. Offloading tasks virtual switching, flow classification, filtering, intelligent load balancing, QoS, and encryption/decryption is extremely imperative and they should be looked at in a holistic manner.
2. VirtIO : Carriers and service providers have been very successful in using these functions which can be transparent to the application, providing a common management and orchestration layer to the network fabric for network slicing.
3. Virtual machines (VMs) can use accelerated packet I/O and guaranteed traffic isolation via hardware while maintaining standards-based vSwitch functionality.
Recent Industry Initiatives: Industry has come up with an open API for VNF acceleration and VNF testing and API compliance (Pl refer to https://www.slideshare.net/Open-NFP/data-plane-and-vnf-acceleration-mini-summit.