Failproof micro-service: Retry Strategy for intermittent failures

Failproof micro-service: Retry Strategy for intermittent failures

This post is in continuation to Creating a Failure Resilient Application. I highly recommend reading this article before we continue here.

Microservices are made to handle the high load. But like humans, everything has its limit.


Suppose, We are managing the servers for IRCTC and the Auth Service is at its limit because of the Tatkaal window. Any new user trying to log in is waiting in the request queue and failing with timeout

No alt text provided for this image

You hit the login button again. It fails again.

Every Service before throwing Errors retries to best of it's capabilities before it hits the limit shown in the diagram below.
No alt text provided for this image


Simple Retry

No alt text provided for this image
https://gist.github.com/NavjotBansal/5cc17a0183fb5e6e9db879909ff168e1

When you see such situations, it's essential to identify how can we simplify people's life. Retrying infinitely would not only disappoint one user but can break the whole server by exhausting its resources.

We can start by controlling the congestion in request-queue.

Intuition

"Slow down, give it some rest, leave it for some time".

We really do this.

In order to achieve this, we make use of a simple algorithm called Exponential backoff.

Exponential-backoff

Overview

After every retry failure, we increase the wait time for requests by a factor generally kept at 2 seconds.

We do not try infinitely, the max retries control the count as shown below.

Algorithm

No alt text provided for this image
https://gist.github.com/NavjotBansal/e0a8792494668d1e3d5bd96a3371b613
Exponential backoff with jitter using a base time of 1 second and an exponent of 2, with the maximum wait time between calls is 30 seconds
No alt text provided for this image
Retry strategy in exponential-backoff

The system would exponentially increase the time from 1,2,4,8,16 seconds and would cap to 30 seconds till the threshold is reached

Problems with Exponential Backoff? Adding Jittering

What if requests from multiple users fail at the same time?

No alt text provided for this image
User call-count density graph

It would mean all the user retries would happen in the same instance.

Again we hit the same problem with the only possible solutions.

  1. Drop the failing connection permanently
  2. Modify exponential backoff to add randomness between calls

This thought of adding randomness in the Exopential Backoff sleep time is called Jittering.
No alt text provided for this image
Retry Strategy in Exponential-Backoff with Jittering

After adding jittering out density graph would look something like this

No alt text provided for this image
Balancing retry calls with jittering

When we compare the two graphs it's obvious that the graph density has been distributed. A side-by-side comparison has been shown below

No alt text provided for this image
No alt text provided for this image

The diagrams mentioned above compare Exponential Backoff and Jittering for multiple users using the microservice simultaneously. The key observation is to compare how Exponential backoff with Jittering proposes two discrete times for the users to resolve plausible race conditions.

Of all the implementations I have seen the default mechanism provided is almost always Exponential backoff with Jitter.

要查看或添加评论,请登录

Navjot Bansal的更多文章

社区洞察

其他会员也浏览了