登录查看更多内容

Whose Fault is it? Achieving resiliency in the cloud

Richard Moran

MD, Global Head of Enterprise Technology Services @Invesco

发布日期: 2019年4月29日

A fault in traditional engineering terms is an incorrect step or process leading to a failure. In the world of software this often results in an application or service outage. This is why we strive to design and implement systems to be fault tolerant. Such systems are a combination of the infrastructure as well as the software platforms and applications running on that infrastructure. Resiliency is a shared responsibility between the infrastructure (physical or virtual) and their hosted applications. In the past with our legacy mainframe and distributed systems the primary responsibility for resiliency in this shared model was on the infrastructure. In the cloud this paradigm has shifted significantly and now the applications have assumed much of that burden. This means that applications have to be much smarter and that they must be designed from the ground up to be resilient and self-healing. The cloud service providers (CSP's) support application resiliency by exposing infrastructure services such as load balancing and auto-scaling, but it is up to architects and developers to use them in order to achieve it. Along those lines I like to think of application resiliency in the cloud across three dimensions: architecture, deployment and observability.

Architecture

In the cloud we run our applications on commodity, multi-tenant, ephemeral infrastructure. Failures of services, hardware and networks are statistically inevitable so we need to embrace this by designing our applications to anticipate these failures and have the ability to recover from them quickly. The 12-Factor architecture principles are a good place to start with this kind of design. All of the principles are important, but for resiliency there are a few to focus on in particular including isolation of dependencies, concurrency, disposability, backing services, process scaling and port binding. Essentially what this adds up to is an application that is de-coupled from the underlying infrastructure and operating system, accesses everything it needs via web services, is stateless within a web session context, can scale components independently and can recover from failures with fast startups and controlled shutdowns. This also requires the implementation of standard design patterns such as circuit breakers to handle graceful downgrade of functionality, service discovery to ensure location transparency (e.g. you don't bind to a service on a static IP address) and dependency injection to dynamically configure your application in an environment. These things will allow an application to handle failures of dependent services or infrastructure and operate properly in a elastic cloud environment.

Deployment

Design is important, but implementation matters just as much. You can design your application to be resilient and recover from failure, but if your deployment topology is not executed properly it will make little difference. In order to ensure resiliency, an application must be deployed into a well designed landing zone that takes advantage of the appropriate cloud infrastructure services. This means that applications are deployed with instances in multiple availability zones (i.e. physically separate cloud data centers), leverage load balancers to distribute traffic across those instances and that they are configured with auto-scaling groups so they can elastically scale up or down to meet demand. Depending on the required availability and recovery time objective (RTO) for an application this may also require a deployment to multiple geographic regions using a global traffic manager (e.g. AWS Route 53, Azure Traffic Manager) with policies for priority (active/passive) or geographic affinity (active/active). In an everything as code model the configuration for a deployment should be captured as a parameterized ops-code template (e.g. Cloud Formation, ARM, Terraform) that is stored in version control as part of the application code base. This applies equally for deployments into container run-time environments with yaml files and helm charts for Kubernetes or manifest files for Cloud Foundry.

Observability

One of the keys to successful operations for an application in the cloud is observability. If you can't see what is going on you can't act on it. This means that applications and the hosts they run on need to stream telemetry out to a logging platform endpoint where they can be aggregated and indexed. This can be accomplished through native cloud capabilities such as Cloudwatch or vendor products such as Datadog, New Relic, Splunk or Sysdig. The advantage of the vendor products is a generally rich set of dashboarding and analytical tools, but this comes of course at a cost. The point here is that the application needs to be decoupled from the log aggregation mechanism (i.e. the app just writes to stdout/stderr). Application operators should also not expect to have interactive access (i.e. ssh into a host) as that should be restricted in production and by the time you log onto an ephemeral host having trouble it is probably gone anyway. It is important that applications log enough information to raise alerts and diagnose issues quickly. There are graduating maturity levels in observability, but initially it is good enough to match patterns in the logs and raise alerts from them consistently. Over time application teams can progress to capabilities that can apply artificial intelligence to spot anomalies and eliminate noise or potentially do predictive analytics to identify impending issues before they actually occur.

Closing thoughts

There is a lot think about when running an application on a cloud platform, but the most important thing to remember is that it is a shared responsibility model. The CSP's publish service level availability metrics, but these really serve to tell you that the services will fail at some point and that your application needs to be ready to handle it. As an example AWS says in its service level agreement for EC2 that it will use "commercially reasonable efforts" to make services available for each AWS region with a monthly uptime percentage of at least 99.99% (i.e. 4 9's). That sounds good and it is, but it means that you should be prepared to expect 4.38 minutes of downtime in a region per month or 52.6 minutes per year. It doesn't sound like much, but if services are timing out and you have escalating and cascading failures spreading to multiple tiers in your system, even a few seconds can feel like an eternity. The point is that you need to be ready for this with the right architecture, deployment topology and observability for your applications in the cloud. Once you have these things in place a good way to validate resiliency is with chaos testing (in pre-production environments anyway). It takes time, engineering discipline and experience to get it right, but if you do then you won't be asking "whose fault is it ?" when something goes wrong.

Ken Carroll

Smart Buildings Leader

5 年

Nice article, Rick

1 次回应

Ramatchandiran Krishnaswamy

Vice President - Global Account Executive

5 年

Thanks Richard. Nicely articulated. Look forward to your future post

Steven Meallo CISSP

Director of Information Security

5 年

Thank you for this post. I look forward to future posts related to maintaining security in the cloud as well.

Kevin Jones

Customer Success at Orbee | MBA

5 年

Thanks, Richard?for this informative article illustrating key PE considerations for Cloud-based systems!

查看更多评论

要查看或添加评论，请登录

Richard Moran的更多文章

An Intelligent Strategy for AI Architecture

2024年5月14日

An Intelligent Strategy for AI Architecture

Artificial intelligence (AI) has a rich history that began sometime in the 1950’s with the theoretical concepts of…

4 条评论
A Blueprint for Enterprise Architecture

2021年9月14日

A Blueprint for Enterprise Architecture

Creating and maintaining a complex information technology environment is a lot like building a city. There is…

9 条评论
The Agile Architecture Challenge

2020年10月30日

The Agile Architecture Challenge

Over the past 25 years I have seen the pendulum of architecture swing back and forth from top-down, centralized…

8 条评论
The solution to enterprise architecture at scale is community

2020年4月8日

The solution to enterprise architecture at scale is community

Enterprise architecture has two big challenges. The need to focus on strategy and the need simultaneously to provide…

7 条评论
Climbing the right mountain

2019年6月26日

Climbing the right mountain

It takes hard work and a lot of preparation to climb a mountain. You can't just show up and hope you will be successful.

4 条评论
You cant' take it with you, or can you? The realities of cloud application portability

2019年5月31日

You cant' take it with you, or can you? The realities of cloud application portability

They say you can’t take it with you, but when it comes to moving from one cloud service provider (CSP) to another if…

3 条评论
The Three Pillars of Digital Transformation

2019年5月15日

The Three Pillars of Digital Transformation

The concept of digital transformation evokes a wide range of ideas and images of large scale technology and…

5 条评论

See all articles

Whose Fault is it? Achieving resiliency in the cloud

Richard Moran

MD, Global Head of Enterprise Technology Services @Invesco

Architecture

Deployment

Observability

Closing thoughts

Richard Moran的更多文章

社区洞察

其他会员也浏览了

Integrable: The Blueprint for Scalable Digital Infrastructure Across Private/Public Sectors

Transforming digital infrastructure for digital transformation and artificial intelligence

Load Balancer Market – Forecast, 2024-2030

The Power of Just-In-Time Infrastructure

Technical Infrastructure: The Foundation of Your Digital Transformation Journey

Roadmap to Legacy ApplicationMigration: Key Benefits & Challenges

How Prudent Tech Revamped Fleet Speed’s IT Infrastructure:

Building Resilient Infrastructure with Terraform

Review of Amazon Web Services (AWS) for Financial Institutions (FIs) mission critical applications design

AMC Inform Q4 | Micro Focus Emerging

Architecture

Deployment

Observability

Closing thoughts

Richard Moran的更多文章

An Intelligent Strategy for AI Architecture

A Blueprint for Enterprise Architecture

The Agile Architecture Challenge

The solution to enterprise architecture at scale is community

Climbing the right mountain

You cant' take it with you, or can you? The realities of cloud application portability

The Three Pillars of Digital Transformation

社区洞察

其他会员也浏览了

Integrable: The Blueprint for Scalable Digital Infrastructure Across Private/Public Sectors

Transforming digital infrastructure for digital transformation and artificial intelligence

Load Balancer Market – Forecast, 2024-2030

The Power of Just-In-Time Infrastructure

Technical Infrastructure: The Foundation of Your Digital Transformation Journey

Roadmap to Legacy ApplicationMigration: Key Benefits & Challenges

How Prudent Tech Revamped Fleet Speed’s IT Infrastructure:

Building Resilient Infrastructure with Terraform

Review of Amazon Web Services (AWS) for Financial Institutions (FIs) mission critical applications design

AMC Inform Q4 | Micro Focus Emerging