Internet Egress Security Architecture for AWS Workloads | Part 1 - Regional Hubs
Some Background
Over the past 18 months I have been working with organizations and the Zscaler team to help deliver security for public cloud workloads, such as AWS, Azure, and GCP. As many of you know and have been doing for many years, it's possible to send your internet-bound traffic to the Zscaler Internet Access (ZIA) platform using a variety of methods. Getting traffic to ZIA Service Edges includes GRE Tunnels, IPSEC Tunnels (including SD-WAN integrations), Client Connector, and even PAC files. This flexibility is one of the many things that make the Zscaler platform so amazing.
Enter public cloud. AWS. Azure. GCP. You can just configure some GRE or IPSEC tunnels and forward internet-bound workload traffic to ZIA easily, right? Well, not really :) Some cloud providers don't support GRE tunnels and some of the native VPN/IPSEC tunnel capabilities do not support the resiliency/HA many organizations require. There are third party solutions, such as deploying virtual cloud routers, and then setting up IPSEC tunnels to ZIA. This can work but in most cases we are seeing this is not scalable. Not just from a throughput perspective but operationally.
Zscaler for Workloads offers a component called Cloud Connectors. They are Zscaler purpose-built gateways that can be deployed into public cloud platforms and forward traffic to both Zscaler Internet Access (ZIA) and Zscaler Private Access (ZPA) platforms. Cloud Connectors are EC2/VMs, integrate with cloud provider's native load balancers, scale horizontally, and are deployed with IaC Tools such as Terraform and CloudFormation. Cloud Connectors securely forward traffic to Zscaler using DTLS/TLS tunnels, something many customers will be familiar with because it is the same underlying tunneling technology Zscaler offers with Client Connectors . If you want to learn more please visit https://www.zscaler.com/solutions/infrastructure-modernization/cloud-connectivity
What is a Workload, anyways? Any service or machine that communicates on the network and typically does not have a user logged into it. This can be EC2 Instances, RDS Instances, EKS (containers) Nodes, Lambda Functions, etc.
Cloud Connectors on AWS
Let's quickly cover the Cloud Connector component to provide more familiarity and context for the rest of the article. In this example, we have decided to create a Zscaler VPC in the AWS Account that has the regional Transit Gateway. Zscaler can automate the creation of the VPC using Terraform, but many organizations utilize existing code or processes for the underlying network. Zscaler generally recommends a minimum of 2 AZs and to deploy the components into private subnets because no inbound connectivity from the internet is needed. In this case, we have:
Once the underlying network is in place, we deploy using Terraform or CloudFormation. In this example, the default Zscaler TF/CFT templates will deploy a Lambda Macro, one Cloud Connector per Subnet/AZ (m5.large), a GWLB Service, a Target Group including the Cloud Connector service ENIs, and a GWLB VPC Endpoint in the same subnets as the Cloud Connectors.
Upon successful enrollment with Zscaler, the Cloud Connectors, by default, will each discover and establish 2 outbound tunnels (can be unencrypted, DTLS/TLS) to the closest optimal Zscaler Service Edges. Each Cloud Connector has an Active tunnel for forwarding workload traffic to Zscaler, with the secondary/backup tunnel is in standby modes.
Please note this is the default configuration and behavior. Organization can utilize forwarding rules to send traffic to different destinations and service edges. This means that when a Cloud Connector processes traffic it is possible for it to have multiple active tunnels established to forward the traffic to the different destinations per the forwarding rules:
For more information please read https://help.zscaler.com/cloud-branch-connector/networking-flows-cloud-connector
That should be a good enough introduction to what and how the Cloud Connectors work in AWS for the purposes of this article. Let's move on to the design and topology decisions!
AWS Topologies
Many organizations have massive AWS footprints. Surprisingly, the complexities I see are not tied to the quantity or types of workloads running in AWS: EC2 instances, RDS instances, Lambda functions, EKS nodes, etc. Operationalizing granular security policies does take time but is not the forefront of most conversations.
So what is? Cloud Network Topology & Design Decisions. There are so many considerations to account for and many are not Zscaler-specific. The right design will optimize costs, reduce operational overhead, and not compromise security. I hope to share some insight and my experiences having designed and deployed this with many organizations.
Let me reiterate one more thing: There is no single answer or design that works for every deployment, but I am starting this blog series with what I have seen to be the most common on AWS: Regional Security Hubs using Transit Gateway (TGW). In a nutshell this is basically a hub-and-spoke model in which the Workload Spoke VPCs are connected to a regional TGW, routing can be centralized, and security products can be deployed into a "security", "inspection" or "egress" VPC to service all the Spoke VPCs. There are many configuration options that would require me to write a 50 page document on, so I will keep to the point and not talk about every single nuance. I'll refer to these hubs being per-region, but it is possible to have multiple hubs/TGWs per region. Large organizations may separate the workloads across environments: Test/Dev, QA, Production. So just keep in mind your exact design will vary.
There are many benefits of using TGW, and you can learn more by going to https://docs.aws.amazon.com/prescriptive-guidance/latest/integrate-third-party-services/architecture-3.html and https://aws.amazon.com/transit-gateway/features/ or just use your favorite search engine :)
Is this model right for us?
Let's make it simple. When you are wondering if this is the best or possible topology when it comes to Zscaler, ask yourself these 3 questions:
If you answered yes to at least 1 of the questions then this might be the best option for you as an enterprise standard. Does that mean you will be "all or nothing"? Nope. Thanks to AWS innovation, the AWS Gateway Load Balancer (GWLB) offering enables security services and vendors like Zscaler to utilize what is called a Distributed GWLB Endpoint model. This means you can have centralized regional Security VPCs with Zscaler Cloud Connectors and secure Isolated VPCs that are not peered or connected to the Security VPC via TGW! If this concept is new to you, don't worry, I will explain this a bit more if you keep reading...
领英推荐
What about other design options? We'll cover that in the Part 2 but to give you a sneak peak.. it's a fully decentralized model where each VPC is Isolated without any peering or TGW connectivity. For this article we are only talking about the centralized hub model!
Architecture Overview
Let's take a look at this high level design example where the organization has decided to deploy Zscaler Cloud Connectors into regional hubs because most of the workload spoke VPCs are already connected via Transit Gateway. However, the organization also has a few Isolated VPCs that do not require access to private resources or applications, but needs internet egress protection. Instead of deploying Cloud Connectors directly into the Isolated VPC, the organization simply connected to GWLB Service fronting the Zscaler Cloud Connector by deploying a GWLB VPC Endpoint into the Isolated VPC. This hybrid model provides the benefits of:
The diagram depicts a hybrid approach but there are some variables I want to call out just because a single diagram can't account for every possibility:
Brief note: Many customers bring up questions around traffic they do not want to send to Zscaler. The implementation details vary based on the use case but with the use of routing, forwarding rules, and other configurations it is possible to send all or some traffic to Cloud Connectors and/or Zscaler. We will not cover this topic in this article but it's an important consideration that Zscaler is aware of!
Spoke VPCs to Zscaler via TGW
In many cases, most of the VPCs will be connected to the Transit Gateway where the Cloud Connectors are deployed. As we zoom into this portion of the network diagram, we can see the Spoke will route all traffic destined outside of its own VPC to the TGW. If the TGW Route Table is configured to send the default route to the Zscaler VPC, the TGW route table in the Zscaler VPC will default route to the GWLB VPC Endpoints fronting the Cloud Connectors.
The Cloud Connectors will then forward the traffic appropriately, such as using the established DTLS tunnels to the Zscaler Service Edges as depicted in the diagram below. When GWLB Cross-Zone Load Balancing is Enabled (which is our recommendation), GWLB will be able to send traffic to Cloud Connectors across all AZs instead of only the workload source AZ. This is important from an HA/resilience perspective because if the Cloud Connector(s) in AZ1 are unable to tunnel the traffic to Zscaler, the healthy Cloud Connector(s) in AZ2 can forward that traffic without interruption. It is also important to note that if a primary tunnel fails to connect to Zscaler from a Cloud Connector, a secondary tunnel will be marked as active and used to forward traffic (as depicted in red below).
Now we have traffic routing from a Spoke VPC to Zscaler so the workloads are protected. Although I am simplifying this for the purpose of this article, you would start associating other workload Spoke VPCs in this region to the TGW Route Table that is pointing to the Zscaler VPC to protect them as well. It is mostly "rinse and repeat" at this point for all VPCs in the region, and then the next region, etc.
Isolated VPCs to Zscaler with Distributed Endpoints
Last but not least, what about those Isolated VPCs in the same region that have no peering or TGW connectivity to the Zscaler VPC? This is where we zoom into this portion of the diagram and show the connectivity looks almost identical to TGW from a diagram perspective. A minor, but critical detail, is that instead of TGW attachment we have simply deployed a GWLB VPC Endpoint that connects to the existing GWLB Service fronting the Cloud Connectors (from the TGW diagram). This connection is basically just using AWS PrivateLink to stay on the AWS backbone/network, but allows for the same connectivity out to the Internet via Zscaler protection!
So in the above diagram you'll notice the architecture/topology is still centralized, but the AWS GWLB Service enables connectivity to Zscaler without VPC connectivity! From a routing perspective the differences in this method are:
Are there any advantages or disadvantages to the Distributed Endpoint model instead of just attaching them to TGW?
What's Next
In Part 2 of this article series, we will cover a fully decentralized AWS model where you deploy the Cloud Connectors into each Workload VPC with direct secure internet access. Don't worry, the next articles will cover Azure, GCP, and then ZPA-specific use cases too. I plan to write a new part every few weeks!
Now, you might have some questions. Please don't hesitate to reach out to your Zscaler Customer Success Manager or Account Team and ask for a Workload Communications Discovery Workshop. Nothing beats some diagrams, digital or in-person white boarding, and talking through all the details of the design.
Don't forget to sign up for a self-paced hands on learning lab: https://cms.zscaler.com/workload-communications-workshop
Security Engineer at Rheintec Solutions AG | CISSP
8 个月Great article Zoltan Kovacs, is there a possibility to use Cloud Connector as an explicit proxy? I have seen your Zscaler Community post about it but could not find all the details. https://community.zscaler.com/s/question/0D54u00009eykQ0CAI/awsazure-traffic-forwarding-options-to-cloud-connectors
Public Cloud | Software Defined Networking | SASE | Zero Trust Networking
1 年Thanks for sharing, great read.
Helping New Zealand Inc thrive by improving security, improving user experience, and reducing cost.
1 年Good article Zoltan. We are seeing a steady progression of zero trust architecture from Zero Trust for Users to Zero Trust for Workloads with most of our customers. As they bed in and reap the rewards of the user to application piece, it gives them the confidence to repeat the exercise for cloud to internet, and cloud to cloud. And the results are the same - reduced attack surface, increased visibility and management, simplified architecture.