Failproof micro-service: Retry Strategy for intermittent failures
Navjot Bansal
Building Computer Vision Systems @Oracle | Software Architecture | System Design | ICPC Regionalist
This post is in continuation to Creating a Failure Resilient Application. I highly recommend reading this article before we continue here.
Microservices are made to handle the high load. But like humans, everything has its limit.
Suppose, We are managing the servers for IRCTC and the Auth Service is at its limit because of the Tatkaal window. Any new user trying to log in is waiting in the request queue and failing with timeout
You hit the login button again. It fails again.
Every Service before throwing Errors retries to best of it's capabilities before it hits the limit shown in the diagram below.
Simple Retry
When you see such situations, it's essential to identify how can we simplify people's life. Retrying infinitely would not only disappoint one user but can break the whole server by exhausting its resources.
We can start by controlling the congestion in request-queue.
Intuition
"Slow down, give it some rest, leave it for some time".
We really do this.
In order to achieve this, we make use of a simple algorithm called Exponential backoff.
Exponential-backoff
Overview
After every retry failure, we increase the wait time for requests by a factor generally kept at 2 seconds.
We do not try infinitely, the max retries control the count as shown below.
领英推荐
Algorithm
Exponential backoff with jitter using a base time of 1 second and an exponent of 2, with the maximum wait time between calls is 30 seconds
The system would exponentially increase the time from 1,2,4,8,16 seconds and would cap to 30 seconds till the threshold is reached
Problems with Exponential Backoff? Adding Jittering
What if requests from multiple users fail at the same time?
It would mean all the user retries would happen in the same instance.
Again we hit the same problem with the only possible solutions.
This thought of adding randomness in the Exopential Backoff sleep time is called Jittering.
After adding jittering out density graph would look something like this
When we compare the two graphs it's obvious that the graph density has been distributed. A side-by-side comparison has been shown below
The diagrams mentioned above compare Exponential Backoff and Jittering for multiple users using the microservice simultaneously. The key observation is to compare how Exponential backoff with Jittering proposes two discrete times for the users to resolve plausible race conditions.
Of all the implementations I have seen the default mechanism provided is almost always Exponential backoff with Jitter.