Demystifying PaaS security (part 1)
As a sequel to this article, today I propose to engage in a series of architecture security articles about PaaS in Public Clouds; compute PaaS specifically. The security of compute PaaS is not well documented, so many questions are asked and concerns are raised by the IT security community. Key foundations are simple and can be explained without breaching any non-disclosure agreement, though.
This first installment will be looking at tenant isolation.
Feeling at home within shared compute services
At a first glance, the problem looks simple enough: you want to isolate each tenant binaries (or code, if interpreted) into separate VMs. To put it another way: you let the host (hypervisor) do all the fencing around guests (customer VMs) executables and I/Os.
Even if it seems the most straightforward way to fall back to a known risk management situation (we’ve all been doing this stuff on premises for over a decade now), it’s worth noticing that up until recently, not all mainstream public cloud providers have followed this path. But so far, and to the best of my knowledge, alternate ways have fallen short and VM isolation is about to become the de facto standard.
So let’s take this for granted. Before seeing the engineering issues and caveats it implies, we should first consider the kind of compute capability that are delivered on top of guests. For best portability and execution times, containers are the obvious answer. The fact that containers (or pools of containers, all single tenant) are started in dedicated guests kind of defeats the prospect of getting lightning fast execution times, but here Cloud providers leverage the full power of their workhorse backends to pre-provision as many (or as few) instances as necessary to render good customer experience at affordable price.
A side note: in the Windows version of Docker, Microsoft came up with an elegant solution to wrap all this up into a single concept: running containers in hyperv isolation. If hyperV is enabled on your PC or if you rent a VM in Cloud with nested hypervisor support, you can easily test it by yourself.
Now let’s see how well this plays out in practice for various customer compute needs:
? Event driven and/or single compute task: if you read my mind, then yes you will know that I think of the wide and fuzzy range of autonomous tasks from functions to stateless long running jobs. Here the guest & container recipe fits pretty nicely because it is easy to implement at scale and is easy to schedule/monitor. Incidentally, that’s the family of compute services that’s been the first available on the PaaS marketplace, just after guests only recipes like Beanstalk. We will see in a further installment that other big factors come into play to explain this timeline, however.
? Microservice: with a few exceptions that usually bring maybe 10% of the value of an API, even the most basic microservice is made of several compute and stateful components that need to communicate through some network layer. Here the simple model described above starts to fall apart: to sustain microservices, Cloud providers need some kind of single-tenant containers machinery able to manage topology, communication and state awareness.
? Application: providers need to manage containers in a mono tenant way as above, but with much more components layered into tiers. Also add to the fact that not all application parts scale at the same pace and with the same service level contracts.
Exactly like containers have not been designed to run in multi-tenant guests, off the self container managers like Swarm or Kubernetes have not been designed to run in multi-tenant public platforms.
In Swarm or Kubernetes, there’s an insuperable deal of entanglement between execution nodes and master nodes and/or a dangerous proximity between controller containers and tenant containers. Customers wishing to run full fledged APIs or applications in public Clouds PaaS have long had to resort to clusters run in dedicated guests, a less than ideal situation when one comes with cloud native experience in mind. That’s one of the key reasons why AKS, ACS, ECS and EKS have been available on the PaaS marketplace before more cloud native compute offers.
Solution design for secure serverless computing
Let's see how Cloud providers have overcome this for serverless.
- Amazon’s way: what is fascinating about AWS engineers is their almost supernatural gift for making right design choices from start. I suspect Werner Vogels has a lot to do about it, although I cannot prove it... I belonged to the previewers of ECS when it came out not so long ago. At that time, I must admit that I didn’t like the way Amazon did the scheduling of my containers. I found the ECS admission control based on AWS-provided jobs and state machines both cumbersome and not cloud friendly. But this is the direct consequence of the fact that in ECS, execution nodes and controller nodes are clearly separated; of course there is a tight relation between a set of execution nodes and the set of its controller nodes, that's maybe why it has taken time to establish a secure delineation between both sets in Fargate. What I considered as an awkward design pattern turned out to pave the way for an actually efficient serverless containers management in AWS!
- Azure’s ways: as often in Microsoft’s post Ballmer universe, there is not one but several ways to explore and reach given goals. The ways are not competing against each other, they rather stimulate themselves like twin R&D projects. The management of compute services is no exception: if you are an Azure customer, you may use app services or service fabric clusters (I leave grid computing and batches aside). From a bird eye’s view, both offer more or less the same set of compute capabilities: functions, containers, binaries. Stateful or stateless. The difference gets even thinner as time goes by. But for serverless computing, as of October 2018, only one seems to stand out: service fabric mesh. This service takes the best of two things: the strong and ever growing management capabilities of service fabric on the one hand, VM isolation on the other hand. My guess is that the operating system behind Service Fabric Mesh is Windows and that containers are isolated in hyper-v mode, but the actual implementation could be very different. Since Service Fabric supports Windows and Linux alike (another post Ballmer consequence), it might make more sense for Azure to keep operational excellence by running a mix of both flavors as part of the global Service Fabric Mesh offer? Time will tell.
By making their own container management solution, Cloud providers have been able to decouple multi tenant activities (all concentrated into provider-managed controllers) from mono tenant ones. Customer workloads are isolated into guests thereby reducing the co-residency risk to a VM escape risk. The residual risk is not voided but standardized, it's up to each customer to either accept it or find additional risk-reduction measures.
In a next installment, we will look at other engineering challenges providers have had to meet in order to provide secure compute PaaS, some of which still unsolved.
Senior Cloud security architect at Société Générale
6 年Bastien and Fouzi: thanks for your support, I appreciate! The next article is in the pipe...
Tech Lead Public Cloud @Société Générale
6 年Great article... looking forward the next chapter.
Group Offensive Security Officer
6 年It's always interesting Christophe, thanks for taking the time to write.