Fail Solo - A Fault Tolerance Story

Fail Solo – A Fault Tolerance Story

The genesis for this article was Zafar’s comment on my previous one (so, thanks Zafar). Since we only touched very briefly on the need to isolate failures, this article will attempt to cover the ground in detail.

Let us go back to an application X running on a tomcat server serving a high throughput – for a moment, assume it is serving 200 requests per second. Our hypothetical server also connects to a database to retrieve some data that it needs to process to create its response. For simplicity, assume that one of the requests that X is serving becomes faulty. In real world, it will usually be a bunch of requests and not a single isolated request; but the single faulty request is a simpler scenario for us to discuss and understand.

What exactly do we mean by a faulty request? A faulty request, for the purpose of this article, is a request that will take at least two orders of magnitude higher time to be served than the average; in simple terms, a faulty request will take 100 seconds if the average request takes a second. There are many reasons this request may become faulty; the request may be making an expensive query to the database, there was a network issue that led to high latency in the request. We do not care about the reason; it is sufficient to understand that there will always be some faulty request – bad coding, weak infrastructure or some other reason that we could not even identify.

Given that such a faulty request exists, what is the best case scenario for us? No, the answer is not to avoid creating such a request with better code review, performance testing; something or the other will always slip through in any high velocity team.

The best case scenario is Fail Solo; i.e., this request will take 100 seconds but it will not impact the other requests in any way. The worst case scenario is a domino effect; this one faulty request leads to multiple other requests failing; which lead to other requests failing, and so on, till your server is just spinning without any requests being served correctly. In real world, a single faulty request does not create a domino effect; but a bunch of requests can, and often do.

The goal of infrastructure design is to get as close to the Fail Solo scenario as possible. You can never hope to reach the Fail Solo scenario; but, if you carefully design your system, you will minimize the number of requests that get impacted. Let us call this number as Cascade factor CF and our goal will always be to have a CF as close to zero as possible.

Why is keeping a CF of zero immensely hard? There are two ways faults cascade

1.??????Dependency or Vertical cascade – There may be another service that uses the service X and hence the faulty request will lead to higher response time in the request that called X. Such a vertical cascade is fairly common now – as we have multiple layers of service that invoke each other.

2.??????Peer or Horizontal cascade – The faulty request will take away some resources during its execution; peer requests will get blocked waiting for the resource to get free and hence the peer requests will also become faulty.

Circuit breakers take care of the vertical cascade, whereas timeouts, separate thread and connection pools are often used to take care of the peer cascade. Every single technique to achieve Fail Solo is based on the principle of isolation; even if a faulty request is taking 100 seconds, upstream requests will open the circuit after 2 or 5 seconds. Separate thread pools are created for every downstream dependency (e.g., separate one for workers getting data from MySQL and a separate one for getting data from Redis; separate pool of workers and connections for listing page versus item details page versus review page). The Fail Solo principle is also applicable on the database layer; long running queries should get terminated automatically.

All these actions together are what makes your application robust; a few rogues requests, a spike in the core network, an ad-hoc query made on the database – none of these should bring down your application. You can’t save the offending requests; they will fail, and rightly so.

This brings us back to serverless. A truly serverless service should implement Fail Solo and have a CF of less than or equal to 1. Think of this as the R-factor of Covid. If one faulty request will create more than one faulty request, your overall system is bound to collapse after a while. However, if you have a CF of less than 1, a few requests will fail but the system will stabilize itself. If a service claims to be serverless but does not have a CF<1, then do not consider it as serverless; it is just a marketing hype created to ride the wave.

Making a service truly serverless is really hard; more so for databases. This has to be ground up; in terms of your quorum strategy between peers that keep a copy of the data, how you handle partition tolerance and the fundamental trade-offs you have made. For example, most truly serverless stateful services will not be strongly consistent; they will need to settle for eventual consistency to ensure they can maintain availability and handle partitions in the cluster (remember CAP theorem). Stateless services are simpler to make serverless; simpler, but not trivial. The bottom line is the Cascade Factor (CF); no one will ever tell you what it is. You will just have to dig deeper and come back with an answer; and that is the fun part of this job, right?

DynamoDB is a pretty good example of a truly serverless service (look at their SOSP 2007 paper and try to connect the dots); or, maybe it deserves a separate article in itself.?

Zafar Ahmed Ansari

Senior Computer Scientist

2 年

Thanks Akshat for delving into the principle of isolation and how that relates to CF. It is very interesting to imagine how it would have been implemented. I would go through the SOSP 2007 paper and try to connect the dots, but indeed your article does set up the high level background of the concept. One related concept for this could be the concept of backpressure of the reactive manifesto (https://www.reactivemanifesto.org/glossary#Back-Pressure) which provides feedback to react to, to the downstream layers. Backpressure could help in implementing fail-fast resiliency for the vertical cascade that you mentioned.

回复
Surabhi Agarwal

Senior Test Engineer at Google

2 年

Thank you for the article, Akshat. Reminded me of the cascaded address formatter threads that had caused the feasibility requests to become faulty!

回复

要查看或添加评论,请登录

Akshat Verma的更多文章

  • The MVP conundrum

    The MVP conundrum

    What should be the scope of our MVP? This is probably one of the topics that is the most debated, while building a…

    2 条评论
  • Layoffs, hiring and staying true to yourself

    Layoffs, hiring and staying true to yourself

    This has been the season of layoffs - the worst we have seen in a while. A network like Linkedin has been useful as…

    5 条评论
  • Can you be an excellent architect without product management skills?

    Can you be an excellent architect without product management skills?

    One very welcome change I have seen in India is the rising trend of engineers aiming to stay technical for a longer…

    3 条评论
  • What a head-of-engineering is meant to do?

    What a head-of-engineering is meant to do?

    Recently, a person who was earlier part of my team and is transitioning to a head of engineering role asked what a…

    4 条评论
  • Visual Maps - the key to solving complex problems

    Visual Maps - the key to solving complex problems

    Visual learning has been in vogue the past few years - learning via images and videos. The logic for visual learning is…

    1 条评论
  • Resisting the urge to overfit

    Resisting the urge to overfit

    I drive back via NH8 from Gurgaon to Delhi on a daily basis. For folks familiar with the drive, the entry to Delhi…

    1 条评论
  • The real cost of feature sprawl

    The real cost of feature sprawl

    I was going through this interesting post from Ashish Kashyap on how productivity reduces when we have more people in a…

    9 条评论
  • Why "First Principles" thinking is so hard

    Why "First Principles" thinking is so hard

    "First principles" is quite in vogue these days. One can find a lot of articles that extol the virtues of "first…

    7 条评论
  • why your tech architecture needs to adapt to your team team

    why your tech architecture needs to adapt to your team team

    Many many years back, I was taking an interview for a principal engineer position. As is the norm in all my senior…

    3 条评论
  • The fine line between solving a painpoint and being creepy

    The fine line between solving a painpoint and being creepy

    Being a product manager is a hard job; being a product manager at a cross-sell company is even harder. A cross-sell…

    4 条评论

社区洞察

其他会员也浏览了