登录查看更多内容

CPU + IPU : Why Multi Cluster Orchestration becomes super important?

Srinivasa Addepalli

发布日期: 2021年9月4日

There is quite a bit of buzz on DPUs (Data Processing Units) in the last few months. I, personally, don't like to call them DPU as every processing unit processes data and hence there is nothing special about DPUs. Since, the intention of DPUs is mainly meant to offload infrastructure away from CPUs, I like the term "IPU" (Infrastructure Processing Unit), which Intel uses.

IPUs are expected to do much more than what SmartNICs do. In the case of SmartNICs, some processing is offloaded to SmartNICs from the CPU. if I take networking as an example, one can offload K8s root network namespace processing, or even the entire TCP/IP processing. Still, both applications from tenants, applications from the infrastructure provider still run on host. In case of IPU, entire infrastructure applications get run in IPU leaving CPU for tenant applications.

Written three posts on SmartNICs for Cloud native networking before. You can find them here(https://www.dhirubhai.net/pulse/cloud-native-networking-offloads-edge-computing-why-addepalli) , here (https://www.dhirubhai.net/pulse/cloud-native-networking-offloads-edge-computing-why-addepalli-1c) and here (https://www.dhirubhai.net/pulse/cloud-native-networking-offloads-edge-computing-why-addepalli-2c).

This post only focuses on CPU+IPU in the context of cloud native and Kubernetes. IPUs also might be used as SmartNICs also in addition to IPU functionality, but this post only focuses on the IPU part of IPUs.

Before going further, it is important to understand the reasons behind IPU. These are a few things I got from AWS Nitro, VMWare Monterey project and a few Industry articles. Following is my interpretation of this literature.

What are infrastructure applications in a Cloud native/K8s environment? And why do one want to move these infrastructure applications to the IPU instead of running them on the main CPU? These are two questions I am sure many of you might have.

Infrastructure applications, in cloud native environments such as Edge - Other than compute, almost everything else - are related to Storage, Security, Networking, Service Mesh and connectivity.

Hyper converged Storage - Local storage across multiple compute nodes treated as one big logical storage and creates persistent volumes for tenant applications. I guess the intention is to have disks associated with IPU instead of host CPU and letting IPU take care of any storage related processing. Essentially, IPU is the storage processor.
PKCS11 based HSM - That protects certificate private keys of tenant applications. Making an instance of HSM available to the requested tenant POD.
Secrets Vault - That protects secrets/passwords of tenant applications. Making an instance of the vault available to the requested tenant POD.
Virtual Switching & TCP/IP networking - To take care of networking needs of tenant applications.
Accelerators - That help in speeding up tenant applications by providing virtual instances of accelerators to the requested tenant PODs.
Service Mesh data plane - Envoy proxy functionality. Making a proxy instance made available to the requested tenant PODs.
Connectivity - Private 5G, Public-5G UPF, 5G RAN, SDWAN CNFs to enable local connectivity & WAN connectivity

Pictorial representation of above :

领英推荐

Four HPC Trends from SC22

Achronix Semiconductor Corporation 2 年前

Dense Storage Makes CDNs Better, Dell's New Tower is…

StorageReview.com 1 年前

got GPUs?

David Levy 1 年前

What is the motivation for this? Though I am not convinced on some of the motivations yet, let me provide them anyway here.

Security isolation: Since infrastructure applications are not mixed with the tenant applications on the same processor & operating system, there is better security and confidence that infrastructure applications are safer.

Performance isolation: IPUs by taking care of infrastructure components, there is a good performance isolation and that ensures that tenant workloads continue to get SLAs promised even if infrastructure components are overloaded.

Control & Management: Provides better control to infrastructure providers with respect to upgrades and life cycle management with minimal interference on the tenant workloads.

Composability: Disaggregation of infrastructure from the tenant workloads leads to composition of accelerators, HSMs, Vault instances from other nodes if the local IPU node does not have free resources.

IPU/DPU systems are going to be complex. They are like servers in servers. They run normal OS such as Linux. Infrastructure applications can come up at any time. Hence, they also require similar resource management as host. So, thinking is that IPUs/DPUs run on its own K8s cluster. So far, an Edge has one K8s cluster for both infrastructure and tenant workloads. Now, an Edge will have two clusters - One for tenant applications and another for infrastructure applications. Multi Cluster Orchestrators become even more important with an increasing number of clusters. Yet at times, there needs to be some coordination required between infrastructure exposure workloads and tenant workloads and hence it becomes super important to have multi cluster orchestration.

How does networking work? Both tenant applications and infrastructure applications may need to share the same POD subnet and cluster IP subnet to ensure that they talk to each other. Due to this, the belief is that you need to have the entire networking part of the IPU and provide transparent access to networking at the socket level for tenant applications. In this case, we believe that following component changes may be required at the tenant K8s level - Socket proxy (such a way that tenant applications work as-is), new container runtime (to take care of creating network namespace in IPU for each tenant POD). CNIs, Kube-proxy and other networking related infrastructure elements may be brought on IPU directly (Can be via K8s of IPU).

Since, IPU is expected to expose accelerators, it needs to have some management related workloads. And hence, it needs to have its own K8s and networking for LCM of management & housekeeping workloads. Since Ethernet ports may be shared among host and IPU K8s clusters, there needs to be a way to switch packets among these two. Following picture shows one architectural choice that depicts above.

In summary, IPU for cloud native deployments is at a very nascent stage in the Industry. It is not very clear on whether some of the motivations are compelling enough to have this complexity. Some say that AWS Nitro is IPU and is successful and hence the belief is that even cloud native deployments (Edge providers) would require this architecture. This is needs to be seen. In the EMCO community, we have started to just talk about it and these are initial thoughts.

Srinivasa Addepalli

3 年

Nice and succinct description on the need for DPU (IPU) from Ihab Tarazi is here: https://www.delltechnologies.com/en-us/blog/entering-the-next-frontier-smartnic-data-processing-unit/ To me, new thing from this article is this: " For locations where physical space is a premium, like network edge locations and cell towers, DPUs can be installed within existing servers to deploy a variety of needed physical and virtual functions again without requiring additional hardware.? "

Viswa K.

3 年

Srini, is IPU architecturally different than CPU? Or is it something like redundant CPU hosted servers dedicated for infra needs vs dedicated guest needs (tenants) ?

2 次回应

Srinivasa Addepalli

3 年

Also check this out. Well written and to the point. https://www.dhirubhai.net/pulse/ipu-new-strategic-resource-cloud-service-providers-patricia-kummrow-1c/

1 次回应

查看更多评论

要查看或添加评论，请登录

Srinivasa Addepalli的更多文章

About HTTP/3.0 - Browser behavior

2024年8月27日

About HTTP/3.0 - Browser behavior

HTTP has evolved over the years, from the original version HTTP/0.9 to the current standard HTTP/2.

2 条评论
SASE/SSE Security for Managed, Unmanaged, and Zero-Touch BYODs

2024年5月31日

SASE/SSE Security for Managed, Unmanaged, and Zero-Touch BYODs

In today’s digital landscape, enterprises see a diverse range of devices, each with its own set of characteristics and…

2 条评论
No, SWG technology is not going to disappear

2024年5月19日

No, SWG technology is not going to disappear

Secure Web Gateway (SWG) technology is a proxy-based solution designed to protect users' endpoint devices from…

2 条评论
Network-as-a-Service with 5G, SASE, MCN and role of Kubernetes and Project EMCO

2022年9月11日

Network-as-a-Service with 5G, SASE, MCN and role of Kubernetes and Project EMCO

NaaS (Network-as-a-Service) is again being talked about in the recent past in the Industry. Few years back, many…

5 条评论
Realizing next generation ZTNA and a design pattern for Next generation network security

2022年9月5日

Realizing next generation ZTNA and a design pattern for Next generation network security

Introduction NIST publication has the best description of zero trust. Few important points made by NIST in the…

2 条评论
5G & Multi Edge Application Delivery Networking - Role of Akraino/ICN & EMCO

2021年9月20日

5G & Multi Edge Application Delivery Networking - Role of Akraino/ICN & EMCO

Background Content delivery networking is well known in the Industry, but it is limited to distribution of static…
Reduce Carbon footprint with right software architectures - Kubernetes, WASM and EMCO role

2021年9月12日

Reduce Carbon footprint with right software architectures - Kubernetes, WASM and EMCO role

In my last post (https://www.linkedin.

2 条评论
Rethinking on Zero Trust Architecture Solutions

2021年9月3日

Rethinking on Zero Trust Architecture Solutions

There is a lot of news about massive data breaches. Look at the list of data breaches in Aug 2021 alone here:…

15 条评论
Complementing Google Anthos/Microsoft Arc/AWS GitOps with EMCO for distributed application orchestration across clouds

2021年8月19日

Complementing Google Anthos/Microsoft Arc/AWS GitOps with EMCO for distributed application orchestration across clouds

All public cloud providers seem to implement GitOps way of deploying applications in the managed K8s clusters. The…

4 条评论
Telco Primed Service Mesh

2021年8月7日

Telco Primed Service Mesh

This article (https://www.linkedin.

6 条评论

See all articles

CPU + IPU : Why Multi Cluster Orchestration becomes super important?

Srinivasa Addepalli

领英推荐

Srinivasa Addepalli的更多文章

社区洞察

其他会员也浏览了

Resource Limits in Kubernetes

Dedicated CPU Vs Shared vCPUs

CPUs and GPUs meet your new friend, DPUs

Why Tier 0 Is a Game-Changer for GPU Computing and Storage

LimitRange in Kubernetes

Dedicated CPU Vs Shared vCPUs

Revolutionizing Cloud and Edge Computing with Unified RISC-V Architecture

Dedicated CPU Vs Shared vCPUs

AMD announce Fifth-Gen EPYC ‘Turin’ Processors: A Major Leap for Enterprise, AI, and Cloud

‘top’ reporting accurate metrics within containers?

领英推荐

Srinivasa Addepalli的更多文章

About HTTP/3.0 - Browser behavior

SASE/SSE Security for Managed, Unmanaged, and Zero-Touch BYODs

No, SWG technology is not going to disappear

Network-as-a-Service with 5G, SASE, MCN and role of Kubernetes and Project EMCO

Realizing next generation ZTNA and a design pattern for Next generation network security

5G & Multi Edge Application Delivery Networking - Role of Akraino/ICN & EMCO

Reduce Carbon footprint with right software architectures - Kubernetes, WASM and EMCO role

Rethinking on Zero Trust Architecture Solutions

Complementing Google Anthos/Microsoft Arc/AWS GitOps with EMCO for distributed application orchestration across clouds

Telco Primed Service Mesh

社区洞察

其他会员也浏览了

Resource Limits in Kubernetes

Dedicated CPU Vs Shared vCPUs

CPUs and GPUs meet your new friend, DPUs

Why Tier 0 Is a Game-Changer for GPU Computing and Storage

LimitRange in Kubernetes

Dedicated CPU Vs Shared vCPUs

Revolutionizing Cloud and Edge Computing with Unified RISC-V Architecture

Dedicated CPU Vs Shared vCPUs

AMD announce Fifth-Gen EPYC ‘Turin’ Processors: A Major Leap for Enterprise, AI, and Cloud

‘top’ reporting accurate metrics within containers?