Push the Limit
The perfect storm of problems can lead to a mega-disaster.

Push the Limit

6 ways to prevent cascading failures

Working closely with a number of the world’s largest tech companies, including Google, and seeing the struggles each has, gives me an appreciation for the value of well reasoned distributed system design patterns and operational practices. Many such practices seem perfectly sensible when suggested, but few large scale systems actually employ them consistently in practice… until a massive service failure results from a preventable blind spot.

Massive system failures are rarely caused by a single simple stimulus. After all, we test for the things that are most likely to occur, and probably have failure modes we can accept for those. The big failures tend to be caused by unexpected combinations of problems that happen together, and lead to a special nightmare known as a cascading failure. This is a mode of failure involving a chain reaction of problems under just the right kind of trigger that causes a service to fall down systematically like a line of dominoes. This is an on-call technician’s worst nightmare scenario. When problems like this happen, and are finally resolved, leaders ask what can be done to make sure that it does not happen again.

The answer can seem simple, yet challenging to achieve, which is why many times nobody put in the preventative measures into the original design. I’ll give an example of one failure trigger of the type that I’ve seen lead to two massive global outages in large mission critical systems in the last couple of months at different companies. Let’s explore how that trigger can be handled so that the damage is contained before it leads to a systemwide cascading failure.

No alt text provided for this image

I call this one the quota trigger. Each resource your system depends on has a finite capacity. There is a certain amount of disk space, CPU power, network bandwidth, etc. In cloud environments, various resources can be programmatically allocated on demand so that if you need more, you can have an automated process add it for you. The provisioned capacity of something is a hard limit, meaning that you need to actually add more of it before you run out in order to avoid exhaustion of that resource. Well configured systems also have more conservative limits on such resources to prevent runaway use of something by mistake. That lower soft limit is known as a “quota” in cloud environments. Think of this like the redline on a tachometer with an engine rev limiter. The use of the term quota dates back to the use of file system resources in UNIX systems. Individual users are assigned a usage quota that limits the number of files and the total storage that can be consumed by that one user, and the administrator may adjust the limits per user based on need. Cloud providers do the same thing, but with limits on thousands of different types of resources, and the use of them. Typical quotas control the total number of something you can use, or the rate of use per unit of time (like the number of messages sent per minute in an email system). The quota trigger is when a system malfunctions because a resource it expects to allocate fails because the allowed limit of that resource is exceeded, resulting in an error.

Suppose the quota you reach is the total rate at which you are allowed to write new blocks of data to a storage resource. When the system is running normally it never approaches this limit, and all the data you expect to save gets stored. But suppose something causes your activity level to surge beyond the quota. What happens? The upstream system returns an error instead of saving your data, and your system needs to handle that condition.

Subsequent problems happen when critical components are blocked and waiting for a retry rather than processing more work. Think of this like a log jam in a creek. If the creek gets blocked by a big long log gone sideways, it backs up lots of other things floating behind it, like a dam. Suppose that I have a resource that admits new user logins to an ecommerce checkout system, and it needs to keep a transaction log of every user to record the date and time and outcome of the login attempt. The log is critical because it’s legally required by government regulations. If the logging process gets an error because it can not write a log entry because the blocks/sec write quota has been reached, subsequent logins can not be recorded until the logging system is done waiting, and unblocks, like breaking that log jam free in the creek example. So the example cascading failure sequence is:

  • A huge promotion is announced by email to millions of consumers who flock to purchase this amazing deal all at once. Response is much higher than anticipated.
  • Elevated system usage triggers blocks/sec write limit, resulting in errors.
  • Retries of the write exacerbate the quota overrun.
  • Dependent login actions stack up waiting their turns to write their audit log entries.
  • The maximum thread concurrency of the system is reached with waiting/retrying clients.
  • New orders can no longer be created because all available threads are busy waiting to write to storage.

Note that the blocks/sec quota limit in this example is baked on a known limitation of the underlying hardware system. It’s simply not possible for it to actually write any more blocks than the limit allows, and remain available. In this circumstance, the symptom observed by some users is “I can’t sign in to retrieve my saved addresses”, and for others is “I can’t view my shopping cart”, and for others still the symptom is “I can’t check out”. Your support team hears from users, and declares “the ecommerce system is down”, your operations team is alerted, who decide to restart the system, and because the write rate is still over the configured quota, it immediately jams up as well, and the symptoms persist.

A well designed system would have some fault tolerance features to help prevent this outcome. In this case, I recommend:

  1. Error retries should be configured with an exponential backoff (progressively longer delays between retry attempts).
  2. Use a sensible limit to the total number of attempts per client such that no client tries more than maybe 10 times before giving up.
  3. Plan a degraded mode. Configure an alternate action to take instead of logging the event. Is there a secondary logging resource to use that could be used temporarily while the primary is unavailable?
  4. Employ a circuit breaker pattern. If one client is busy doing error retries, a “breaker” flag is flipped that causes concurrent attempts by other users to immediately try the alternate action, or return errors to the user, resulting in rapid condition handling, and keeping system concurrency low for the blocked user to potentially succeed. Use a periodic task to test that a write is again possible, and when it succeeds, reset the “breaker” flag to allow concurrent use again.
  5. Keep track of your utilization against the quota controlled resource. Yes, monitor it, and alert when utilization is: Above the expected level by a significant margin, or within a set percentage of the configured limit. NOTE: If you have an automated process to increase the limit, or provision more of a resource, attempt it when your alert levels are reached.
  6. Build automated testing into your CI/CD process to verify your mitigations work, so you don’t suffer from a future regression of the countermeasures.

In summary, for every critical function in your system, use exponential backoffs and error retry limits, have an alternate “degraded mode” action to take in the event that a system you depend on returns an error (or fails to return within a sensible timeout). Use a circuit breaker to prevent runaway resource usage within your own system, particularly for concurrency related resources such as threads or processes. Finally, give your operational teams automatic notice, and activate automated processes to expand system capacity.

There are more sophisticated error handling techniques that you can try once you master the basic ones. For example, using a service mesh you can keep track of the known effective limit for a resource using an adaptive limit you track within your system, as well as the utilization of that resource. When your utilization approaches the known limit, employ a pattern known as DROP_OVERLOAD. This works by permitting all of the work that fits under your known limit, and rejecting a portion of the work that will not fit within your limit. This way, you continue to service the maximum capacity your system is capable of, but only return errors for the portion of work that you can’t satisfy, rather than refusing all work during error conditions.

Generally speaking, I don’t think it’s a good idea to put all of this cross-system error handling into your services directly. This will make your services substantially more complex, and therefore more difficult to maintain over time. A better approach is to keep your services as simple as possible, and use a dedicated intermediate system to handle error retries, exponential backoff, circuit breakers, and even DROP_OVERLOAD. This way all of your services can benefit from the additional resiliency. Also, if you have a pool of workers all contributing to a portion of our workload, you can more easily track the global utilization level of a given resource across all those workers, rather than trying to estimate it based on the local activity level of a single worker multiplied by a number of workers (which may later be changed, throwing the calculation off). A system like Istio is an example of a service mesh where you can implement resiliency patterns like these in a logically centralized way with a fully distributed and independent execution of those safeguards.

Using a mesh allows you to employ many of these techniques between various web systems and HTTP(S) oriented services without modifying any of your existing software. You’ll still want to put reliable basic error handling into every service that may be missing it, but using a service mesh can help you to add resiliency to systems that you can not modify, either because they are licensed in executable form from a software vendor, or perhaps because you no longer employ staff with the right expertise to modify them, or you no longer have the source code used to build them to begin with. These things happen all the time, especially in large organizations with complex aging infrastructure.

Think about each of the hard and soft limits of the resources that affect all instances of your software systems together as a group (zonally, regionally, and globally), and ask what would happen if you exhausted that limit. What degraded mode could be used for each? Consider using a service mesh to artificially inject errors to simulate what would happen if various limits were reached, so you can increase confidence that they won't trigger cascading failures, and that your desired degraded mode is successfully employed. Don't have one yet? Maybe add it to your tech debt backlog as a priority.

Follow me on: Linked-In, Twitter

Ruturaj Doshi

Software Quality Practitioner with DevOps experience, Business domain Enterprise Cloud Management Product

3 年

Dear Adrian, like to the point write up by you, I had different perspective for prevention - how about a Microservice which acts as Observer and flushes irrelevant/stuck request, I/O based on Time SLA's and log only in Microservice which was late !

Jerzy Foryciarz

Group Product Manager -Google Cloud Platform

3 年

One the other case I often see happen is a set of stacked services independently autoscaler on CPU utilization. When the front one fais all subsequent services scale down to min. Then when the front service gets fixed the increase of traffic is beyond the scale readiness of the downward ones. More often than not the whole system goes in the spirale of death. We recommend using schedule based autoscaling as a second signal to HPA or going with a custom metrics derived from traffic of frontend service .

Sachin Lakhanpal

Engineering Manager, Ads Infrastructure

3 年

Very crisply written article!

This was a super insightful read! I learned a lot!! Thank you so much for posting, Adrian!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了