Whose Fault is it? Achieving resiliency in the cloud
A fault in traditional engineering terms is an incorrect step or process leading to a failure. In the world of software this often results in an application or service outage. This is why we strive to design and implement systems to be fault tolerant. Such systems are a combination of the infrastructure as well as the software platforms and applications running on that infrastructure. Resiliency is a shared responsibility between the infrastructure (physical or virtual) and their hosted applications. In the past with our legacy mainframe and distributed systems the primary responsibility for resiliency in this shared model was on the infrastructure. In the cloud this paradigm has shifted significantly and now the applications have assumed much of that burden. This means that applications have to be much smarter and that they must be designed from the ground up to be resilient and self-healing. The cloud service providers (CSP's) support application resiliency by exposing infrastructure services such as load balancing and auto-scaling, but it is up to architects and developers to use them in order to achieve it. Along those lines I like to think of application resiliency in the cloud across three dimensions: architecture, deployment and observability.
Architecture
In the cloud we run our applications on commodity, multi-tenant, ephemeral infrastructure. Failures of services, hardware and networks are statistically inevitable so we need to embrace this by designing our applications to anticipate these failures and have the ability to recover from them quickly. The 12-Factor architecture principles are a good place to start with this kind of design. All of the principles are important, but for resiliency there are a few to focus on in particular including isolation of dependencies, concurrency, disposability, backing services, process scaling and port binding. Essentially what this adds up to is an application that is de-coupled from the underlying infrastructure and operating system, accesses everything it needs via web services, is stateless within a web session context, can scale components independently and can recover from failures with fast startups and controlled shutdowns. This also requires the implementation of standard design patterns such as circuit breakers to handle graceful downgrade of functionality, service discovery to ensure location transparency (e.g. you don't bind to a service on a static IP address) and dependency injection to dynamically configure your application in an environment. These things will allow an application to handle failures of dependent services or infrastructure and operate properly in a elastic cloud environment.
Deployment
Design is important, but implementation matters just as much. You can design your application to be resilient and recover from failure, but if your deployment topology is not executed properly it will make little difference. In order to ensure resiliency, an application must be deployed into a well designed landing zone that takes advantage of the appropriate cloud infrastructure services. This means that applications are deployed with instances in multiple availability zones (i.e. physically separate cloud data centers), leverage load balancers to distribute traffic across those instances and that they are configured with auto-scaling groups so they can elastically scale up or down to meet demand. Depending on the required availability and recovery time objective (RTO) for an application this may also require a deployment to multiple geographic regions using a global traffic manager (e.g. AWS Route 53, Azure Traffic Manager) with policies for priority (active/passive) or geographic affinity (active/active). In an everything as code model the configuration for a deployment should be captured as a parameterized ops-code template (e.g. Cloud Formation, ARM, Terraform) that is stored in version control as part of the application code base. This applies equally for deployments into container run-time environments with yaml files and helm charts for Kubernetes or manifest files for Cloud Foundry.
Observability
One of the keys to successful operations for an application in the cloud is observability. If you can't see what is going on you can't act on it. This means that applications and the hosts they run on need to stream telemetry out to a logging platform endpoint where they can be aggregated and indexed. This can be accomplished through native cloud capabilities such as Cloudwatch or vendor products such as Datadog, New Relic, Splunk or Sysdig. The advantage of the vendor products is a generally rich set of dashboarding and analytical tools, but this comes of course at a cost. The point here is that the application needs to be decoupled from the log aggregation mechanism (i.e. the app just writes to stdout/stderr). Application operators should also not expect to have interactive access (i.e. ssh into a host) as that should be restricted in production and by the time you log onto an ephemeral host having trouble it is probably gone anyway. It is important that applications log enough information to raise alerts and diagnose issues quickly. There are graduating maturity levels in observability, but initially it is good enough to match patterns in the logs and raise alerts from them consistently. Over time application teams can progress to capabilities that can apply artificial intelligence to spot anomalies and eliminate noise or potentially do predictive analytics to identify impending issues before they actually occur.
Closing thoughts
There is a lot think about when running an application on a cloud platform, but the most important thing to remember is that it is a shared responsibility model. The CSP's publish service level availability metrics, but these really serve to tell you that the services will fail at some point and that your application needs to be ready to handle it. As an example AWS says in its service level agreement for EC2 that it will use "commercially reasonable efforts" to make services available for each AWS region with a monthly uptime percentage of at least 99.99% (i.e. 4 9's). That sounds good and it is, but it means that you should be prepared to expect 4.38 minutes of downtime in a region per month or 52.6 minutes per year. It doesn't sound like much, but if services are timing out and you have escalating and cascading failures spreading to multiple tiers in your system, even a few seconds can feel like an eternity. The point is that you need to be ready for this with the right architecture, deployment topology and observability for your applications in the cloud. Once you have these things in place a good way to validate resiliency is with chaos testing (in pre-production environments anyway). It takes time, engineering discipline and experience to get it right, but if you do then you won't be asking "whose fault is it ?" when something goes wrong.
Smart Buildings Leader
5 年Nice article, Rick
Vice President - Global Account Executive
5 年Thanks Richard. Nicely articulated. Look forward to your future post
Director of Information Security
5 年Thank you for this post. I look forward to future posts related to maintaining security in the cloud as well.
Customer Success at Orbee | MBA
5 年Thanks, Richard?for this informative article illustrating key PE considerations for Cloud-based systems!