Be Resilient - Humans and Servers
Kaivalya Apte
The GeekNarrator Podcast | Staff Engineer | Follow me for #distributedsystems #databases #interviewing #softwareengineering
What is Resiliency?
Ability to recover quickly to a “normal” working state from degradations/problems/failures. Like in life, there are always problems (ups and downs), but you don’t want to be stuck with it, rather you want to fight back and come back to your normal state (mental peace and happiness) as soon as possible.
Why Make Systems Resilient?
As a developer/architect of the system its your responsibility to hide the background working problems from system users. You might be having 100s of micro-services hosted on 1000s of servers, but one simple fact of life is “NO ONE CARES”, and rightfully so.
A user of your product (system) wants to use certain functionality, thats it. They do not care how you provide that functionality. To make that functionality as smooth and as reliable as possible you want to overcome the failures you might face on your systems.
But, why do we face problems?
Systems are HARD, Distributed systems are HARDER. Why? because we cannot simply assume certain guarantees in a Distributed Systems and these guarantees tends to make your systems easy :)
Comes the famous fallacies of Distributed systems.
If you could assume all the above in a Distributed system, imagine how easy it would be to work with them. But life isn’t fair, it has challenging plans for you. So none of these is always true.
So basically you have
Ok, I am convinced we have problems. But solutions?
So failures are bound to happen and as developers and architects we need to make sure we have measures in place to overcome those failures. There are strategies we could use depending on the type of failures.
One of the most common things that we do in our applications is making a network call to another service or a database. As we said, network is unreliable, which means these calls may take longer than expected or fail completely.
Timeouts
“Longer than expected” is the key here. You expect a response in a certain time duration. Why? because you don’t want to wait forever, as there is no guarantee that the response will eventually come. The network may have lost the packets, the server may have died etc. So the first strategy is “Timeouts”.
Whenever you make a call, you specify a timeout. This timeout is the duration for which the caller waits for the response, otherwise fails the call assuming, it is not going to work.
Timeouts ensure, you don’t consume resources for a long time, for requests that are bound to fail, you fail fast instead and proceed with other requests or act on the failure accordingly.
But, if the failure happened, and you time out, what next? Well, you retry the call.
Retries
So yes, our first strategy is “Retries”. So you make a call, it fails, you retry, but what if the retry fails as well? You retry one more time and so on. But do we keep retrying forever? Well it really depends on the type of application and the failure you are receiving.
When does retrying makes most sense?
How many times to retry?
If a failure is transient, a few retries are typically enough. But depending on the type of failure, you may want to retry for hours or days. It depends on the use case.
Problem with the basic retry strategy?
As you see in the above diagram, all callers are calling the server and are retrying on failures. The calls are failing (let’s say) because of load on the server, and if all callers are retrying right after the failure, they might hammer the server pretty badly and increase the load further. This is a retry hell, instead of giving time to the server to recover, client retries are increasing the problem.
Exponential retry
Instead of retrying at a fixed interval the idea is to increase the duration between each retry exponentially. So you start at 2sec, then wait for 2^2 = 4 sec then wait for 2^3 = 8 sec and so on. You also add a max cap such that you don’t go beyond a wait of `x` seconds. This x depends on your tolerance with the wait time between retries.
Exponential backoff definitely make the call less and less frequent and reduces the congestion, however it is still possible that all the callers end up calling the same time.
Jitter is a technique to add a little bit of randomness to the “sleep” function, which may seem naive, but it works quite nicely and spreads out the retry schedules. This is a great improvement on top of backoff strategy.
So remember
领英推荐
When things do not happen, don’t get stuck (timeout) and try again (retry) in a better way (wait and prepare well i.e. backoff a bit)
But what if the server is facing a real issue and even exponential retries are not going to help?
Circuit breaker
Making a network call and waiting for some time, only to know that the call is going to fail is a terrible feeling. Sometimes you don’t even want to make the network call and fail fast if the probability of the request failing is high enough.
Circuit breaker is a pattern which is borrowed from electrical circuits. When things go bad, you open the circuit such that no connection can be made. Similarly in the software world, you open the circuit such that no network call is made and clients can fail fast and act on the failure in a different manner.
How does that work?
A library keeps track of the number of failures in a given time range. If the number of failures exceed a threshold, you make the circuit open and when it happens, the call itself isn’t made until the circuit closes again. Circuit can be closed again, when the problem is resolved. This way, all the unnecessary retries that would have just failed while causing congestion on the server side and unnecessary wait times on the client side, were just skipped. This is a big win when server isn’t able to recover quickly and needs some time to breath :)
What happens when there are bad actors (clients) ?
You need some mechanism to reject calls from such bad actors or excessive users.
This is when advance techniques such as Load shedding and Load levelling comes into the picture.
Load shedding
Observe the current load using metrics such as CPU, memory usage, request queue length etc and when this load metric goes above a certain threshold (how do you know this threshold? ans.?Load testing) you start rejecting a fraction of the requests to eventually reduce the load. If the load isn’t reduced much, you increase that fraction, and if the load is reduced you decrease the fraction.
This is a great way to prevent a disaster and enables your servers to serve nominal capacity regardless of the traffic load sent to it. Meanwhile you scale and increase the capacity you can serve.
What is the problem?
Your server rejected the excess load, but the problem is, not all requests are the same when you think from the domain/business perspective. For example: Let’s say there were two calls updatePrice(newPrice) and a updateProductDescription(newDescription) to your server. Which call should be rejected to avoid additional load? Well, clearly updating price seems more important than updateProductDescription. But without using sophisticated mechanisms, you cannot do that. Similarly, lets say if you get two requests, one from your free user and one from paid user, you would want to rejected the free user’s request and serve the paid user’s request, because you have different SLAs for each.
Rate Limiting
A little more advanced load rejecting patterns is rate limiting, where based on the individual request properties you try to reject the requests when limit exceeds. For example, if a free tier user is allowed to make 10req/sec, but they end up making 15req/sec, you can reject the excess load. Similarly, you may do the rate limiting based on the caller’s ip address. There are many ways you can implement rate limiting that primarily depends on your domain/business requirements, but the idea is the same.
Instead of simply rejecting requests, we do granular rejection of requests while prioritising the one’s that are of higher importance.
Don’t worry, this is some issue with the article, it is just what you would return on rejecting the request.
But what happens if the server ends up consuming most of the resources, just rejecting the requests? It might still become a bottleneck. So Load Shedding and Rate Limiting does help, but there is a limitation, mainly because clients and server (nodes) are coupled in the runtime.
Can we decouple the server such that no matter how many requests the clients make, server is still able to process the requests at a constant rate?
Well, yes the next strategy does that.
Load Levelling with Queues
We add a queue to act as a buffer between the callers and the server.
This helps a lot in case of peaks (intermittent heavy loads), as the server now isn’t impacted because it processes tasks from the queue at a fixed rate, no matter how many callers are enqueuing a task in the queue at the same time.
As you can imagine, not all workloads are suitable for this pattern because it is asynchronous and highly synchronous workloads aren’t really compatible with this pattern.
What if the queue becomes full? Well you have all the above patterns with you, so I am sure you can devise a solution :)
I talked about these patterns explaining the intuition behind these patterns while taking some easy to understand example in this video below.
If you like the content, please subscribe to the newsletter and also The GeekNarrator youtube channel. I create, podcasts and videos on technical topics regularly.
Join my?Discord?community.
Cheers,
The GeekNarrator
SDE @Amazon | GSoC @RedHat | Open Source and Coding Mentor |Ex @Nagarro|Ex @Coding Blocks|System Design Content Creator|20k+ linkedin followers|3 million views|open for collaborations
2 年Kaivalya Apte Perhaps the one of the only few newsletters combined with video and detailed write up!
Data enthusiast
2 年Beautifully explained, Kaivalya Apte. The modern cloud infrastructure is constantly evolving and the providers have done a phenomenal job of addressing this pain point: ensuring resilience. On the cloud, with 'auto-scaling' enabled, new workers are spun-up slightly before the threshold is reached and hopefully there is no need to deny client requests. However, there could be programming errors that could bring systems down, no matter how many new VMs are spun-up. A rigorous 'testing' can help us avoid these software issues. Rate-limiting is a good way to not crowd the servers with repeated requests. They can prevent, to some extent, malicious attacks, DoS attacks also. In a managed instance group, constant health checks (polling) are done on each VM worker and if a response is not received within a certain time period, the worker or the connection is assumed dead and a new instance is spun-up in its place ensuring resiliency. Thanks to containerization which makes this possible. In spite of cloud providers automating many of these tasks and carrying out the work quietly behind the scene, we cannot stress enough the need for logging, monitoring, alerts and notifications. Thank you for the post. Will watch the video. ????
Software Engineer @ Amazon | Author of "Build Tech Career" - a newsletter for practical, real-world insights from 10+ years as an engineer
2 年Thanks for writing this. This is readable yet it has good amount of depth. PS: I am going to steal this gif. ??