Fail Solo - A Fault Tolerance Story
Fail Solo – A Fault Tolerance Story
The genesis for this article was Zafar’s comment on my previous one (so, thanks Zafar). Since we only touched very briefly on the need to isolate failures, this article will attempt to cover the ground in detail.
Let us go back to an application X running on a tomcat server serving a high throughput – for a moment, assume it is serving 200 requests per second. Our hypothetical server also connects to a database to retrieve some data that it needs to process to create its response. For simplicity, assume that one of the requests that X is serving becomes faulty. In real world, it will usually be a bunch of requests and not a single isolated request; but the single faulty request is a simpler scenario for us to discuss and understand.
What exactly do we mean by a faulty request? A faulty request, for the purpose of this article, is a request that will take at least two orders of magnitude higher time to be served than the average; in simple terms, a faulty request will take 100 seconds if the average request takes a second. There are many reasons this request may become faulty; the request may be making an expensive query to the database, there was a network issue that led to high latency in the request. We do not care about the reason; it is sufficient to understand that there will always be some faulty request – bad coding, weak infrastructure or some other reason that we could not even identify.
Given that such a faulty request exists, what is the best case scenario for us? No, the answer is not to avoid creating such a request with better code review, performance testing; something or the other will always slip through in any high velocity team.
The best case scenario is Fail Solo; i.e., this request will take 100 seconds but it will not impact the other requests in any way. The worst case scenario is a domino effect; this one faulty request leads to multiple other requests failing; which lead to other requests failing, and so on, till your server is just spinning without any requests being served correctly. In real world, a single faulty request does not create a domino effect; but a bunch of requests can, and often do.
The goal of infrastructure design is to get as close to the Fail Solo scenario as possible. You can never hope to reach the Fail Solo scenario; but, if you carefully design your system, you will minimize the number of requests that get impacted. Let us call this number as Cascade factor CF and our goal will always be to have a CF as close to zero as possible.
Why is keeping a CF of zero immensely hard? There are two ways faults cascade
领英推荐
1.??????Dependency or Vertical cascade – There may be another service that uses the service X and hence the faulty request will lead to higher response time in the request that called X. Such a vertical cascade is fairly common now – as we have multiple layers of service that invoke each other.
2.??????Peer or Horizontal cascade – The faulty request will take away some resources during its execution; peer requests will get blocked waiting for the resource to get free and hence the peer requests will also become faulty.
Circuit breakers take care of the vertical cascade, whereas timeouts, separate thread and connection pools are often used to take care of the peer cascade. Every single technique to achieve Fail Solo is based on the principle of isolation; even if a faulty request is taking 100 seconds, upstream requests will open the circuit after 2 or 5 seconds. Separate thread pools are created for every downstream dependency (e.g., separate one for workers getting data from MySQL and a separate one for getting data from Redis; separate pool of workers and connections for listing page versus item details page versus review page). The Fail Solo principle is also applicable on the database layer; long running queries should get terminated automatically.
All these actions together are what makes your application robust; a few rogues requests, a spike in the core network, an ad-hoc query made on the database – none of these should bring down your application. You can’t save the offending requests; they will fail, and rightly so.
This brings us back to serverless. A truly serverless service should implement Fail Solo and have a CF of less than or equal to 1. Think of this as the R-factor of Covid. If one faulty request will create more than one faulty request, your overall system is bound to collapse after a while. However, if you have a CF of less than 1, a few requests will fail but the system will stabilize itself. If a service claims to be serverless but does not have a CF<1, then do not consider it as serverless; it is just a marketing hype created to ride the wave.
Making a service truly serverless is really hard; more so for databases. This has to be ground up; in terms of your quorum strategy between peers that keep a copy of the data, how you handle partition tolerance and the fundamental trade-offs you have made. For example, most truly serverless stateful services will not be strongly consistent; they will need to settle for eventual consistency to ensure they can maintain availability and handle partitions in the cluster (remember CAP theorem). Stateless services are simpler to make serverless; simpler, but not trivial. The bottom line is the Cascade Factor (CF); no one will ever tell you what it is. You will just have to dig deeper and come back with an answer; and that is the fun part of this job, right?
DynamoDB is a pretty good example of a truly serverless service (look at their SOSP 2007 paper and try to connect the dots); or, maybe it deserves a separate article in itself.?
Senior Computer Scientist
2 年Thanks Akshat for delving into the principle of isolation and how that relates to CF. It is very interesting to imagine how it would have been implemented. I would go through the SOSP 2007 paper and try to connect the dots, but indeed your article does set up the high level background of the concept. One related concept for this could be the concept of backpressure of the reactive manifesto (https://www.reactivemanifesto.org/glossary#Back-Pressure) which provides feedback to react to, to the downstream layers. Backpressure could help in implementing fail-fast resiliency for the vertical cascade that you mentioned.
Senior Test Engineer at Google
2 年Thank you for the article, Akshat. Reminded me of the cascaded address formatter threads that had caused the feasibility requests to become faulty!