Building High-Quality Cloud Solutions (Part 4) - Reliability

Building High-Quality Cloud Solutions (Part 4) - Reliability

While the end-users of most technology products often carry the stigma of being fickle, they can also be quite forgiving. There is a general understanding that technology, for all its glitz and glam, is not perfect, so users can look over the very rare bump in the road. As a service provider, the last thing you want to do is leave your users with the inability to access the service, or worse, have them not trust the service. However, it’s not enough to just work to prevent failures, an equal amount of effort needs to be placed in properly recovering from failures as well.

In Part 3 of this series we discussed how crucial it is to understand where and how latency can be introduced into your system given how modern applications are composed to be distributed. For end-users the lines have blurred between services being completely offline and not being responsive enough, leading to the catchphrase, “slow is the new down.” Now that you have leveraged the proper resources to make your system as efficiently performant as you currently can, we need to make sure when unavoidable hiccups do happen, they are remediated quickly so business can continue as usual.

Properly remediating hiccups plays a significant role in instilling stakeholder’s confidence in the system. That role is just one of several represented by a specific tenet within a standard architecture framework supported by the major cloud platform providers. Each of those tenets helps to form the complete picture in determining the proper direction to take on your cloud journey. As discussed previously, the five tenets are:

In this fourth article of the series, we will focus on Reliability. If the experience of your service is perpetually degraded, due to it constantly being offline or severe data fragmentation causing untrustworthiness, you will lose the confidence of your stakeholders. My goal is to help you examine the factors that can lead to that degraded state so that they can be properly accounted for.

While the concepts presented are universal, the specific examples I lay out will be centered around Microsoft's Azure cloud platform, mainly because that aligns with the breadth of my experience and recent assessments.

Technology Should “Just Work”

I have “grown up” on the web (professionally speaking) so reliability has always been one of those make-or-break attributes connected to any initiative in which I’ve been involved. My first industry job was at an eCommerce company where any amount of downtime could lead to a direct loss in revenue due to missed sales.?An inability to return the system to the correct state would lead to even more problems. Even in the earliest days of the world wide web users had an expectation of going to a website and getting a response, even if they needed to be a little patient (28/56k modems, anyone?).?In those days reliability concerns for desktop applications were significantly different compared to today.?Instead of having data backed-up and served over the web as we do today, if your machine crashed back then with all your data stored locally on your machine, you lost everything.

The 2 Levers of Reliability

In today’s world of modern applications, reliability concerns are virtually the same as they have always been for web-based systems:

  • Availability – The ability for your users to access the system when needed. The system doesn’t even need to be offline for it to be considered unavailable, it could be throwing unrecoverable errors or moving so slowly, preventing the user from accomplishing their task.
  • Resiliency – The ability of the system to recover from failures and continue to function. While this may sound very similar to availability the nuance is with concerns to the state of the system being properly recovered.

Continuing with my eCommerce tales, let’s say you have a blow-up with the backend order processing system, the heart of the entire operation because if you can’t process orders, you can’t make money. The system goes completely offline and is brought back online within a few minutes however, any work performed during that downtime is lost. ?This means new orders are missed completely and orders that were being processed run a significant risk of being restored to an invalid state, potentially leading to issues like customers being over or under charged, or having orders fulfilled incorrectly because they are missing data, etc.

A system with the proper level of resilience in that situation would have the facilities in place to make sure that when availability becomes a problem, issues don’t cascade to a point where the state of the whole system is compromised. There is not a single entity on this planet that is immune to availability issues, the services that seem like they never do have just sunk a lot of resources into reliability and have built their systems in a way there is no single point of failure. So, it is quite possible that a global service you consume may have had availability issues, they just happened in a limited geography and impacted users other than you.

Reliability, like many other things, is a balancing scale. It is possible to over-extend or waste your resources trying to create the most stable system in existence. The most important point here is to understand the availability requirements of your business and take the appropriate tactical mitigations that align with the strategy to meet those requirements.?

Food for Thought

Here are a few questions to get you thinking about how to handle things in your current environment or how you can set things up in a desirable way from the start (if you are considering the move):

  1. What reliability targets have you defined for your application? Availability targets, such as Service Level Agreements (SLA) and Service Level Objectives (SLO), and Recovery targets, such as Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), should be defined and tested to ensure application reliability aligns with business requirements.
  2. How are you handling disaster recovery for this workload? Disaster recovery is the process of restoring system functionality in the wake of a catastrophic failure. It might be acceptable for some systems to be unavailable or partially available with reduced functionality for a period, while other systems may not be able to tolerate reduced functionality.
  3. How does your application logic handle exceptions and errors? Resilient applications should be able to automatically recover from errors by leveraging modern cloud application code patterns including, but not limited to request timeouts for managing inter-component calls and retry logic to handle transient failures with the appropriate back-off strategies.

These three things are far from exhaustive, but the hope is that they inspire you to approach your reliability standards a lot more thoroughly and with more confidence.

Quick Wins

Interested in looking like a hero? Here are a few areas you can explore to increase the reliability of your systems through consistent availability and resiliency:

Local and Geo Redundancy have never been easier thanks to the rise of cloud providers like Azure, AWS, and GCP. Even if you’ve never explicitly leased out data center space for your own hardware and opted to leverage services from a managed hosting provider, having them provision more hardware for you took time. Cloud providers allow you to spin up local redundancy (multiple instances in the same data center) within seconds and geographic redundancy (multiple instances across regions) within minutes. With the right architecture your system could be running on servers on the East coast and West coast without anyone leaving their desk.

Health Monitoring goes hand-in-hand with operational excellence but takes it a step further with concerns to reliability. Having proper health monitoring in place will not only increase your systems observability from an operational perspective, but it will also open the door for additional facilities like auto-healing and process recovery when an unhealthy state is detected. Leveraging tools like Azure Service Heath events and Azure Resource Health events, as well as your own systems tooling, can go a long way in minimizing the necessity for human intervention when certain issues arise.

Leverage Platform Services (PaaS) instead of infrastructure services (IaaS) as high-availability and similar concepts come out of the box with this level of service. So having the correct PaaS configuration in place, which is much simpler than configuring IaaS, will significantly reduce your concerns over availability. On the other side of that, configuring basic backups will go a long way in establishing resiliency. Leveraging services like Azure App Services and/or Azure SQL Databases to establish and manage those backups become that much simpler as well. Then there is the potential cost savings of going from IaaS to PaaS.

Queue-based Load Leveling is one of the best things you can do to a workload, especially one that deals with a high-level of unpredictability in its usage patterns. Use a queue to act as a buffer between a task and a service it invokes to smooth out intermittent heavy loads that can cause the service to fail or the task to time out. This can help to minimize the impact of peaks in demand on availability and responsiveness for both the task and the service. This will require some refactoring of the workload to introduce the queue so it would be up to you to determine of the level of effort trade-off is worth it, but I’m willing to bet that it is.

Are You Fort Knox or Swiss Cheese?

You now see how availability and resiliency go hand-in-hand and why it’s not enough to solely work towards preventing failures, but properly recovering from failures as well. Failures will happen, no matter how much you try to prevent them because we are only human. But just because they happen, it doesn’t mean your end-users need to be affected by them. Proper remediation covers the gaps left behind when the occasional failure does happen.

At this point we have our costs under control, we are excelling in our operations, and our services are performing efficiently and reliably. But before we reach the state of technical nirvana there is one last question, we need to ask of ourselves - are our services built like Fort Knox or are they as porous as Swiss cheese? In the fifth, and final, part of this series we will be discussing defense in depth and how proper security caps everything we have discussed to date.

Reflecting on the questions posed and quick wins provided in this article, how much of this have you experienced already? If none of this was new to you – congratulations because you are well on your way to a highly reliable system! However, if any of this was new, I challenge you to revisit your approach to reliability, start to dig into understanding your high-risk areas, and take steps to mitigate them.

要查看或添加评论,请登录

Jawann Brady的更多文章

社区洞察

其他会员也浏览了