登录查看更多内容

Embracing Application-Centric Infrastructure in the Cloud 2

杨刚

通过抽象、建模和编码解决复杂问题。

发布日期: 2025年2月28日

AWS CDK for EKS: Falling Short in Real-World, Multi-Account Kubernetes Deployments

AWS Cloud Development Kit (CDK) aims to simplify cloud infrastructure provisioning using familiar programming languages. While its EKS module promises to streamline Kubernetes cluster creation and management, a closer look reveals significant shortcomings, especially when considering practical, multi-account EKS deployments. This article will delve into these limitations, arguing that AWS CDK's current EKS implementation, particularly the Cluster.addManifest function, is not truly useful for organizations adopting a shared, multi-account EKS strategy.??

The Illusion of Simplicity: Cluster.addManifest and its Account Boundaries

The Cluster.addManifest(id: string, ...manifest: Record<string, any>[]): KubernetesManifest function in CDK appears to offer a straightforward way to deploy Kubernetes manifests to an EKS cluster. However, this simplicity is deceptive when considering real-world scenarios where EKS clusters are designed to be shared across multiple AWS accounts.

In practice, a central EKS cluster is often shared by various teams or applications residing in separate AWS accounts. This multi-account approach is crucial for security, isolation, and cost management. However, Cluster.addManifest operates under the implicit assumption of a single account and region deployment.

Evidence of this Limitation:

Implicit Same-Account Assumption in CDK Design: AWS CDK's core constructs and IAM role management are inherently designed for deployments within a single AWS account. While the CDK documentation for KubernetesManifest does not explicitly forbid cross-account deployments, its examples and underlying mechanisms are geared towards single-account use cases.
Cross-Account IAM Complexity: Deploying manifests to an EKS cluster in a different account necessitates complex cross-account IAM role configurations. As highlighted in this Stack Overflow discussion on cross-account resource access in CDK, CDK, relying on CloudFormation, faces inherent challenges in managing resources across accounts. Cluster.addManifest does not automatically handle the necessary cross-account IAM role assumptions, making it cumbersome to use in shared EKS environments.
AWS Best Practices Advocate Multi-Account EKS: AWS itself recommends a multi-account strategy for EKS, as outlined in their official documentation on Multi Account Strategy for Amazon EKS. This document details how to share VPC subnets and leverage IAM Roles for Service Accounts (IRSA) for secure cross-account access. The stark contrast between these best practices and the limitations of Cluster.addManifest underscores the tool's inadequacy for real-world EKS deployments.

Ignoring the Network Foundation: A House Without Proper Plumbing

A truly practical EKS solution, especially in multi-account setups, hinges on a robust network foundation. This typically involves:

Transit Gateways: To establish secure and scalable connectivity between VPCs across different accounts.??
VPC Sharing: To allow multiple accounts to share a central VPC and its subnets, often hosting the EKS cluster.
Private Subnets: For enhanced security, ensuring that manifest deployments and application workloads operate within private network segments.

However, AWS CDK's EKS implementation, including blueprints like aws-quickstart/cdk-eks-blueprints, often overlooks or simplifies this critical network layer. While these tools may automate EKS cluster creation and even VPC provisioning, they frequently fall short of providing comprehensive, automated solutions for setting up Transit Gateways or VPC sharing as an integral part of the EKS deployment process.

In real-world EKS architectures, the network layer is not an afterthought; it is the foundation upon which a secure, scalable, and multi-account Kubernetes environment is built. CDK's focus on simplifying cluster creation while abstracting away network complexities leads to solutions that are ill-equipped for production-grade, shared EKS deployments.

Token Resolution Failures: CDK's Promise Undermined

CDK's strength lies in its use of tokens – placeholders that are resolved during deployment, allowing for dynamic configurations and resource references. However, Cluster.addManifest fails to properly resolve these tokens, further hindering its practicality.??

CDK tokens are designed to be resolved within the scope of a single CDK application and CloudFormation stack. When attempting to use a token from a resource within a manifest deployed to a cluster using Cluster.addManifest, token resolution often breaks down. CDK's default token resolution mechanisms are simply not designed to traverse account boundaries.

This limitation forces users to abandon CDK's elegant token-based approach and resort to manually passing concrete values – such as VPC IDs, subnet IDs, and security group IDs – as context parameters or environment variables to their CDK applications. This manual value passing is not only less elegant but also introduces more opportunities for errors and reduces the overall benefits of using CDK in the first place.

Solution begin with networking:

Prerequisite: Embracing Application-Centric Infrastructure in the Cloud 1

Environment: All tightly coupled logical resources as a bounded context—a self-contained vertical slice with all resources needed to deliver a business capability, regardless of their physical location or type.

Enver: Environment Version, Different envers are logical/function consistent with different config values. An enver will deploy and rollback as a unit.

Networking as service: network team owns all VPC related resources in red by managing thru code and lib. The networking account running multiple networking envers, each networking enver contains and shares IPAM, VPC with transit gateway and NAT to workspace accounts, that will get a range of CIDR from shared IPAM and share NAT and internal naming system when deploying vpcs in workload envers. Each VPC can only connect to one Transite GateWay, so VPCs and their resources inside are connected thru TGW, but different TGW’s connected VPCs are physically disconnected.

RDS as service: DB team owns DB cluster hosting DB for other envers that define/own/control DB/Schema/Role/User.

EKS as service: k8s team own Eks cluster host container orchestration for other envers that define/own/control k8s manifests and service account to IAM role mappings.

Same for EC2, MSK, Opensearch, ECS, ElasticCache, Redshift, private link ...

In this diagram:

AWS Account Networking running two isolated envers, NT Enver LE and NT Enver Prod, take NT Enver Prod as example,

1) One transit gateway connecting multiple VPCs across multiple accounts( same region );

2) One NAT to share internet with all connected VPCs;

3) One IPAM and CIDR pool for all connected VPCs' subnet to avoid IP conflicts;

4) Not showing: subnet, routing, SG, DNS, hostedzone, Org, admin delegation, cert, ...

Central VPC as proxy to deploy resources cross VPCs:

1) Running Lambdas to deploy k8s manifest to different EKS clusters from different envers

2) Running Lambdas to deploy DB/Schema/Role/User to different RDS clusters from different envers

Central VPC as proxy/hub to connect cross VPCs: pods in EKS, tasks in ECS connecting to different databases in different RDS clusters from different envers.

Enver 1: declare/control all resources in green inside logically, including k8s manifests and database’s related resources( db, schema, role … ), deploy or roll back in transaction.

Manifest deployed to EKS Enver1’s cluster, pod assuming Iam role 1(thru SA/oidc) to access dynamoDB.
Database, schema, role/user deployed to RDS Enver Prod, ECS task inside assuming iam role 2 to access DB hosted thru TGW

Enver 2: declare/control all resources in purple inside logically too, after Manifest deployed to EKS Enver1’ cluster, pod will:

Assume IAM role A to access DB thru transit gateway( no vpc needed in Enver 2! )
Assume IAM role B to access S3bucket for files.

The platform takes care of the deployments, so that apps and services just focus on business logic/function:

The k8s manifests, declared in Enver 1 and Enver 2 will be sent to EKS Cluster thru Central Account's Lambda function in VPC-Prod.
The DB schema/role/user declared in Enver 1 and Enver 2 will be sent to RDS Cluster thru Central Account's Lambda function in VPC-Prod.

要查看或添加评论，请登录

杨刚的更多文章

AWS Serverless vs. Spring Boot in Kubernetes: A Brutally Honest Guide to Choosing Your Architecture

2025年3月1日

AWS Serverless vs. Spring Boot in Kubernetes: A Brutally Honest Guide to Choosing Your Architecture

Introduction The debate between serverless architectures (e.g.
The Fragmentation Trap: How YAML/Container-Centric GitOps are Hindering Cloud-Native Evolution and Breed Organizational inefficiencies

2025年2月20日

The Fragmentation Trap: How YAML/Container-Centric GitOps are Hindering Cloud-Native Evolution and Breed Organizational inefficiencies

In the quest for cloud-native agility, GitOps has emerged as a powerful paradigm. However, a prevalent approach – one…
Embracing Application-Centric Infrastructure in the Cloud 1

2025年2月16日

Embracing Application-Centric Infrastructure in the Cloud 1

In the world of cloud computing, managing infrastructure and applications has often been a tale of two philosophies. On…
The Architect of Digital Trust: Public Key Infrastructure (PKI) in Kubernetes, SAML, OAuth 2.0, and JWTs – A Deep Dive

2025年2月16日

The Architect of Digital Trust: Public Key Infrastructure (PKI) in Kubernetes, SAML, OAuth 2.0, and JWTs – A Deep Dive

In the complex world of distributed systems and web architectures, secure authentication is the bedrock upon which…
LLMs Are Stateless API Calls: Comparing LangChain and AWS Step Functions?+?Bedrock (Enhanced with AWS CDK)

2025年2月15日

LLMs Are Stateless API Calls: Comparing LangChain and AWS Step Functions?+?Bedrock (Enhanced with AWS CDK)

Large language models (LLMs) are, at their core, stateless API calls. This means that each invocation of an LLM is…
Consistency Models in Software Engineering: Embracing Complexity and Innovation

2025年2月13日

Consistency Models in Software Engineering: Embracing Complexity and Innovation

In today's world of rapidly evolving software systems, consistency remains a cornerstone challenge. From transactional…
Face the Brutal Truth: Merging Code and Distributed Transaction Are Not Just Technical Problems

2025年2月12日

Face the Brutal Truth: Merging Code and Distributed Transaction Are Not Just Technical Problems

Why Ownership, Business Logic, and Conflict Archaeology Define Modern Engineering Introduction: The Illusion of Control…
From RDS-Centric to Distributed Systems: An Evolution Towards Eventual Consistency and Simplified Development with Managed Services

2025年2月12日

From RDS-Centric to Distributed Systems: An Evolution Towards Eventual Consistency and Simplified Development with Managed Services

Introduction: The Shifting Sands of Application Architecture For decades, Relational Database Systems (RDS) have been…
The Consistency Spectrum: How "Good Enough" vs. "Absolutely Correct" Divides the SDE World

2025年2月11日

The Consistency Spectrum: How "Good Enough" vs. "Absolutely Correct" Divides the SDE World

Software Development Engineering (SDE) is often presented as a unified discipline, but beneath the surface lies a…
The PR Queue: A Humiliation for Software Engineering

2025年2月11日

The PR Queue: A Humiliation for Software Engineering

For years, the Pull Request (PR) queue has reigned supreme as the cornerstone of collaborative software development…

See all articles

AWS CDK for EKS: Falling Short in Real-World, Multi-Account Kubernetes Deployments

The Illusion of Simplicity: Cluster.addManifest and its Account Boundaries

Ignoring the Network Foundation: A House Without Proper Plumbing

Token Resolution Failures: CDK's Promise Undermined

Solution begin with networking:

杨刚的更多文章

AWS Serverless vs. Spring Boot in Kubernetes: A Brutally Honest Guide to Choosing Your Architecture

The Fragmentation Trap: How YAML/Container-Centric GitOps are Hindering Cloud-Native Evolution and Breed Organizational inefficiencies

Embracing Application-Centric Infrastructure in the Cloud 1

The Architect of Digital Trust: Public Key Infrastructure (PKI) in Kubernetes, SAML, OAuth 2.0, and JWTs – A Deep Dive

LLMs Are Stateless API Calls: Comparing LangChain and AWS Step Functions?+?Bedrock (Enhanced with AWS CDK)

Consistency Models in Software Engineering: Embracing Complexity and Innovation

Face the Brutal Truth: Merging Code and Distributed Transaction Are Not Just Technical Problems

From RDS-Centric to Distributed Systems: An Evolution Towards Eventual Consistency and Simplified Development with Managed Services

The Consistency Spectrum: How "Good Enough" vs. "Absolutely Correct" Divides the SDE World

The PR Queue: A Humiliation for Software Engineering