New reflections on telecom clouds
The reflected clouds in these jars could symbolize "on prem" (photo courtesy my mother-in-law Willeke Straus)

New reflections on telecom clouds

As we telecom engineers built a first generation of on-prem data centers running 3GPP Network Functions (NFs), in other words Telco Clouds, time to reflect on the lessons learned and of course to sketch a path forward, avoiding the already known pitfalls.

On the plus side, we demonstrated that we could strike the right balance between code running on generic x86 processors and linux OS, and offloading telecom-specific tasks to Intel QAT, DPDK, smarter Network Interface Cards, Hardware Security Modules, leaf switches, 100GE user plane appliances and so on.

In on-prem deployment projects we have seen that most effort is spent in installing the Infrastructure as a Service (#IaaS) and K8s Container-as-a-Service (#CaaS) platforms, as these tasks are revealing deficiencies in the actual underlying hardware, e.g. packet loss in a NIC driver or insufficient write IOPS to persistent storage. The pre-qualification of various hardware models for IaaS/CaaS is a daunting task, as there will be increasingly demanding telecom applications over time, and older hardware generations will need to coexist with the very latest : we cannot assume homogeneity in the long run.

Climbing up the stack, another troubling choice is whether to

  • upload the NF artifacts (software images, HELM charts, customized values) from laptops using the CaaS GUI, or to
  • develop Network Service Descriptors (TOSCA NSDs) for a Life Cycle Manager (LCM) fetching those artifacts from a Git or Harbor repository at the Customer, fed by a CI/CD pipeline, before submitting them to the CaaS API : we could consider this "best practice" in the long term but it slows down proof-of-concepts, rapid lab setups, friendly user trials etc

We learned that HELM charts can be either very complex or fairly simple, based on whether those values (variables making each NF instance unique) are

  • used to set object values in the HELM charts (= locally "Interpreted") , or just
  • uploaded ("patched") as K8s Custom Resource (CR) values into the CaaS platform, where a NF-specific, vendor provided Controller reacts to those modified CR values. A K8s Operator is best defined as an application-specific Controller reacting to CR changes, so we could call this model "Operated" and it has the benefit that the automation logic (Ansible scripts etc. within the Controller) comes from the NF designers themselves, rather than from deployment engineers having to deploy dozens of NFs.

Finally, when the so-called "day 0 / day 1" configuration is not clearly delimited by Low Level Design (LLD) documents, the line is blurry between the

  • vendor-prepared values in the LCM ("day 0"), that are strictly necessary to boot the NF but could include slightly more (e.g. a license, a PKI certificate, etc)
  • vendor-prepared config in the Configuration Manager (CM), i.e. a "Golden Config" which the NF obtains from its Element Manager (EM) as soon as it boots - essential config to deliver the network service (basic EPS/PDU sessions) - we could agree to call this "day 1"
  • vendor-applied NF config through the EM GUI or NF CLI (possibly cross-launched from the EM GUI) - we tend to call this "day 2 config" and it provides all you need to deliver advanced services in production
  • operator-modified NF config once the project goes live into production : "day 3"?

LLDs are a great practice but tend to get out of sync with reality when working late hours to set up a (pre-)production network on time.

Before any NF LCM can occur the K8s Cluster must be prepared, with NF-specific host tunings optimizing the performance and the security of the NF. When not coordinated between the NF product houses (SW development teams), and for third party NFs, those host tunings tend to rip our beautiful uniform IaaS apart in IaaS Host Aggregates (K8s Clusters aka "fragments") that are each only suitable for 1 type of NF. De facto re-introducing the "hardware dependency" that we were combating in the first place...

If we succeed to preserve this uniformity, we can preserve the K8s Scheduler's ability to place pods anywhere in the data center, based on observed load metrics. This will be extremely useful for rack expansions, energy savings at night and optimized resource utilization. Overlay networks (for IPVLAN and SRIOV VLANs, BGP, BFD, etc) will have to be either pre-configured or dynamically reconfigured as the pod moves, a process we have come to call Adaptive Cloud Networking (#ACN). An essential stepstone towards CI/CD and FinOps.

Under impact of increasing ordering delays for hardware, the on-prem telco cloud model is about to evolve in 3 directions, 3 Deployment Models :

  • Software-as-a-Service (#SaaS) in which the NF vendor selects and provides CaaS/PaaS PoPs around the world, an OPExified price plan, does the NF LCM, but not (!) the NF configuration and EM in general, which are still performed by the Customer. NF vendors need to avoid having cloud costs that are directly linear with busy hour network traffic, a possible solution could be Equinix Metal plus Anthos. With ACN of course!
  • hyperscaler-provided on-prem hardware (AWS Outposts, GDC Edge) on which the CaaS/PaaS components are installed from public cloud (AWS VPC and GCP, respectively) and usage is reported to public cloud (events, counters); it's unclear how these hyperscalers are planning ACN beyond their very limited number of servers (GDC : max 6 servers!! we got the kit in our labs last month)
  • System Integrators (SIs) providing pre-qualified on-prem combinations of popular K8s CaaS and PaaS distributions (RHOCP, TAP) with a large catalogue of hardware; an Advanced Cluster Manager should allow central installation and upgrades; ACN is not on the SIs' radar yet (I guess they assume there will forever be too many fragments)

Only the latter 2 models are compatible with traditional CAPEX-mode software pricing, where the Customer purchases an eternal (or at least annual) SW licenses, plus an annual subscription fee to updates/upgrades.

All 3 models are container-on-bare-metal deployments (aka "Cloud Native B", CN-B), which are under scrutiny of national Regulators, who are rightfully concerned about the security, i.e. container breakouts from privileged pods (which we combat with dockremap) or memory exhaustion attacks (we protect with linux cgroup2 memory limits). We have new global labs to demonstrate, negotiate and convince the Regulators and last month we were selected to lead a EU Horizon project in this space.

None of the 3 models relieves the NF vendor from having to pre-qualify the dependencies, performance and security of the PaaS/CaaS/HW combination, although in the third model it happens "just once" and economies of scale result from it.

None of the 3 models relieves the Operator (CSP, Enterprise) from managing the solution in terms of Fault, Performance, (day 2) Configuration, Subscriber provisioning, KPI reporting, etc.

Public Clouds (AWS Regions, Local Zones, GDC Hosted) could one day compete with these on-prem models, although we believe they will initially be complementary. The essential difference being the pricing of compute/storage that becomes linear with network traffic. We heard interest from our CSPs and nationwide Enterprises (Transport, Utilities, Public Safety) to try these out by building a Disaster Recovery site, as pricing appears to be prohibitive for 365-day operation.

24x7 Managed Services are available on top of any business model, where the NF vendor

  • performs NF LCM (upgrades, expansions, FinOps)
  • designs and maintains all NF configurations, reporting to the CSP's product managers
  • troubleshoots end user cases
  • reports on end-to-end KPIs and alarms for the entire solution, for example per Enterprise or per Slice
  • has a 24x7 Operations team that tends to be dedicated to each CSP/MNO

We could agree to call this model #NFaaS, Network-Function-as-a-Service. It will be quite essential to avoid the looming confusion with SaaS, on-prem deployment models or public cloud. NFaaS is compatible with (orthogonal to) any deployment model and any NF vendor.

Finally the NF vendor may add subscriber provisioning, charging, peering, SIP trunking, interconnection, roaming, SIM supply and other services (although then we would probably have to call it #MNE / #MVNE (Mobile Network Enabler / Mobile Virtual Network Enabler) rather than NFaaS or SaaS.

Voilà, I hope you found this overview of models & challenges useful. In this complex world we will all need to agree on terminology, nothing's cast in stone yet, so feel free to agree/disagree/expand below.

Soon we are launching a moderated discussion channel on these topics, with polls to figure out market demand for various models, to discuss the security essentials; so stay tuned, we will soon be able to interact much more on this.

Jeroen van Bemmel

Unlocking Potential Through Technology, Innovation, and Creative Collaboration

2 年

As Danielle Royston likes to point out, the public cloud deployment model isn’t about porting traditional on-prem VNFs - then you hit the economic boundaries, as you state. Instead, consider fully public cloud native functions built as serverless entities that get triggered only when needed, using native databases/storage features, etc. It requires breaking down the software to more fine grained entities than vms or containers. If you do it like that, the economics can make sense

Ahmed Zaidi ????

5G Lead Solution Architect. “The future was made by those would could take a leap of faith.”

2 年

Thanks Thierry Van de Velde for this very interesting notes, indeed CN-B model will be the one that each CSPs should focus on it despite all the security challenges. Regarding the NFaaS the NF vendor should also focus on preconfigured the overlay networks and also be flexible to use MPLS interAS option A&B ?? Thanks you Thierry Van de Velde

要查看或添加评论,请登录

社区洞察

其他会员也浏览了