Be Resilient - Humans and Servers
Be resilient in Life - for humans and servers

Be Resilient - Humans and Servers

What is Resiliency?

Ability to recover quickly to a “normal” working state from degradations/problems/failures. Like in life, there are always problems (ups and downs), but you don’t want to be stuck with it, rather you want to fight back and come back to your normal state (mental peace and happiness) as soon as possible.

Why Make Systems Resilient?

As a developer/architect of the system its your responsibility to hide the background working problems from system users. You might be having 100s of micro-services hosted on 1000s of servers, but one simple fact of life is “NO ONE CARES”, and rightfully so.

A user of your product (system) wants to use certain functionality, thats it. They do not care how you provide that functionality. To make that functionality as smooth and as reliable as possible you want to overcome the failures you might face on your systems.

But, why do we face problems?

Systems are HARD, Distributed systems are HARDER. Why? because we cannot simply assume certain guarantees in a Distributed Systems and these guarantees tends to make your systems easy :)

Comes the famous fallacies of Distributed systems.

No alt text provided for this image
Credit: Wikipedia

If you could assume all the above in a Distributed system, imagine how easy it would be to work with them. But life isn’t fair, it has challenging plans for you. So none of these is always true.

So basically you have

  • Unreliable network - It can drop packets.
  • Latency is > 0 - There is a physical distance to cover, so the latency isn’t ZERO.
  • Bandwidth is finite - You can only send limited data at limited speed. Network congestion is a real thing, when things are happening at high scale.
  • Network is insecure - The only secure system is the one that is disconnected from the network, locked in a iron locker, and buried 100s of feet inside the deep ocean. But guess what, this system isn’t usable anymore.
  • Topology changes often - The only constant in the universe is change. Topology changes often and designing systems keeping that in mind is extremely important. Topology changes may impact the entire working of your system resulting in downtime, which is the last thing you want.
  • There are several administrators - You may have a single administrator for a network of your computer lab, but as systems get bigger and bigger (planet scale), having a single administrator doesn’t work. With multiple admins, comes a need of communicating well. But that is where most of us lack, we are humans and we make mistakes, are ambiguous and sometimes just forget the details :). There are 100s of configs, 10s of databases, 1000s of environment variables and so on. There are so many things a single admin cannot handle.
  • Transport cost is > 0 - Cost is never zero, you pay one way or the other. It is about making the tradeoffs and reducing the cost as down as possible.
  • Network is non-homogenous - Not all the components on the network speak the same language, understand the domain in the same way OR interprets commands and actions in the same way. Ex: a nullable value for one system may be required for another system.

Ok, I am convinced we have problems. But solutions?

So failures are bound to happen and as developers and architects we need to make sure we have measures in place to overcome those failures. There are strategies we could use depending on the type of failures.

One of the most common things that we do in our applications is making a network call to another service or a database. As we said, network is unreliable, which means these calls may take longer than expected or fail completely.

Timeouts

“Longer than expected” is the key here. You expect a response in a certain time duration. Why? because you don’t want to wait forever, as there is no guarantee that the response will eventually come. The network may have lost the packets, the server may have died etc. So the first strategy is “Timeouts”.

Whenever you make a call, you specify a timeout. This timeout is the duration for which the caller waits for the response, otherwise fails the call assuming, it is not going to work.

Timeouts ensure, you don’t consume resources for a long time, for requests that are bound to fail, you fail fast instead and proceed with other requests or act on the failure accordingly.

No alt text provided for this image


But, if the failure happened, and you time out, what next? Well, you retry the call.

Retries

So yes, our first strategy is “Retries”. So you make a call, it fails, you retry, but what if the retry fails as well? You retry one more time and so on. But do we keep retrying forever? Well it really depends on the type of application and the failure you are receiving.

When does retrying makes most sense?

  • Transient failures - when you expect the call the succeed eventually.
  • AND
  • When your call is idempotent. Idempotent calls are safer to retry as they don’t make incorrect state changes.

How many times to retry?

If a failure is transient, a few retries are typically enough. But depending on the type of failure, you may want to retry for hours or days. It depends on the use case.

No alt text provided for this image


Problem with the basic retry strategy?

As you see in the above diagram, all callers are calling the server and are retrying on failures. The calls are failing (let’s say) because of load on the server, and if all callers are retrying right after the failure, they might hammer the server pretty badly and increase the load further. This is a retry hell, instead of giving time to the server to recover, client retries are increasing the problem.

Exponential retry

Instead of retrying at a fixed interval the idea is to increase the duration between each retry exponentially. So you start at 2sec, then wait for 2^2 = 4 sec then wait for 2^3 = 8 sec and so on. You also add a max cap such that you don’t go beyond a wait of `x` seconds. This x depends on your tolerance with the wait time between retries.

Exponential backoff definitely make the call less and less frequent and reduces the congestion, however it is still possible that all the callers end up calling the same time.

Jitter is a technique to add a little bit of randomness to the “sleep” function, which may seem naive, but it works quite nicely and spreads out the retry schedules. This is a great improvement on top of backoff strategy.

So remember

When things do not happen, don’t get stuck (timeout) and try again (retry) in a better way (wait and prepare well i.e. backoff a bit)

But what if the server is facing a real issue and even exponential retries are not going to help?

Circuit breaker

No alt text provided for this image
Circuit breaker

Making a network call and waiting for some time, only to know that the call is going to fail is a terrible feeling. Sometimes you don’t even want to make the network call and fail fast if the probability of the request failing is high enough.

Circuit breaker is a pattern which is borrowed from electrical circuits. When things go bad, you open the circuit such that no connection can be made. Similarly in the software world, you open the circuit such that no network call is made and clients can fail fast and act on the failure in a different manner.

How does that work?

A library keeps track of the number of failures in a given time range. If the number of failures exceed a threshold, you make the circuit open and when it happens, the call itself isn’t made until the circuit closes again. Circuit can be closed again, when the problem is resolved. This way, all the unnecessary retries that would have just failed while causing congestion on the server side and unnecessary wait times on the client side, were just skipped. This is a big win when server isn’t able to recover quickly and needs some time to breath :)

What happens when there are bad actors (clients) ?

You need some mechanism to reject calls from such bad actors or excessive users.

This is when advance techniques such as Load shedding and Load levelling comes into the picture.

Load shedding

Observe the current load using metrics such as CPU, memory usage, request queue length etc and when this load metric goes above a certain threshold (how do you know this threshold? ans.?Load testing) you start rejecting a fraction of the requests to eventually reduce the load. If the load isn’t reduced much, you increase that fraction, and if the load is reduced you decrease the fraction.

This is a great way to prevent a disaster and enables your servers to serve nominal capacity regardless of the traffic load sent to it. Meanwhile you scale and increase the capacity you can serve.

What is the problem?

Your server rejected the excess load, but the problem is, not all requests are the same when you think from the domain/business perspective. For example: Let’s say there were two calls updatePrice(newPrice) and a updateProductDescription(newDescription) to your server. Which call should be rejected to avoid additional load? Well, clearly updating price seems more important than updateProductDescription. But without using sophisticated mechanisms, you cannot do that. Similarly, lets say if you get two requests, one from your free user and one from paid user, you would want to rejected the free user’s request and serve the paid user’s request, because you have different SLAs for each.

Rate Limiting

A little more advanced load rejecting patterns is rate limiting, where based on the individual request properties you try to reject the requests when limit exceeds. For example, if a free tier user is allowed to make 10req/sec, but they end up making 15req/sec, you can reject the excess load. Similarly, you may do the rate limiting based on the caller’s ip address. There are many ways you can implement rate limiting that primarily depends on your domain/business requirements, but the idea is the same.

Instead of simply rejecting requests, we do granular rejection of requests while prioritising the one’s that are of higher importance.

Don’t worry, this is some issue with the article, it is just what you would return on rejecting the request.

But what happens if the server ends up consuming most of the resources, just rejecting the requests? It might still become a bottleneck. So Load Shedding and Rate Limiting does help, but there is a limitation, mainly because clients and server (nodes) are coupled in the runtime.

Can we decouple the server such that no matter how many requests the clients make, server is still able to process the requests at a constant rate?

Well, yes the next strategy does that.


Load Levelling with Queues

No alt text provided for this image
Load levelling with Queue

We add a queue to act as a buffer between the callers and the server.

This helps a lot in case of peaks (intermittent heavy loads), as the server now isn’t impacted because it processes tasks from the queue at a fixed rate, no matter how many callers are enqueuing a task in the queue at the same time.

As you can imagine, not all workloads are suitable for this pattern because it is asynchronous and highly synchronous workloads aren’t really compatible with this pattern.

What if the queue becomes full? Well you have all the above patterns with you, so I am sure you can devise a solution :)

I talked about these patterns explaining the intuition behind these patterns while taking some easy to understand example in this video below.



If you like the content, please subscribe to the newsletter and also The GeekNarrator youtube channel. I create, podcasts and videos on technical topics regularly.

Also follow me on?Linkedin?and?Twitter.

Join my?Discord?community.

Cheers,

The GeekNarrator

Kartik S.

SDE @Amazon | GSoC @RedHat | Open Source and Coding Mentor |Ex @Nagarro|Ex @Coding Blocks|System Design Content Creator|20k+ linkedin followers|3 million views|open for collaborations

2 年

Kaivalya Apte Perhaps the one of the only few newsletters combined with video and detailed write up!

Beautifully explained, Kaivalya Apte. The modern cloud infrastructure is constantly evolving and the providers have done a phenomenal job of addressing this pain point: ensuring resilience. On the cloud, with 'auto-scaling' enabled, new workers are spun-up slightly before the threshold is reached and hopefully there is no need to deny client requests. However, there could be programming errors that could bring systems down, no matter how many new VMs are spun-up. A rigorous 'testing' can help us avoid these software issues. Rate-limiting is a good way to not crowd the servers with repeated requests. They can prevent, to some extent, malicious attacks, DoS attacks also. In a managed instance group, constant health checks (polling) are done on each VM worker and if a response is not received within a certain time period, the worker or the connection is assumed dead and a new instance is spun-up in its place ensuring resiliency. Thanks to containerization which makes this possible. In spite of cloud providers automating many of these tasks and carrying out the work quietly behind the scene, we cannot stress enough the need for logging, monitoring, alerts and notifications. Thank you for the post. Will watch the video. ????

Avneesh Sharma

Software Engineer @ Amazon | Author of "Build Tech Career" - a newsletter for practical, real-world insights from 10+ years as an engineer

2 年

Thanks for writing this. This is readable yet it has good amount of depth. PS: I am going to steal this gif. ??

要查看或添加评论,请登录

Kaivalya Apte的更多文章

  • Crash course on JVM Memory Management:

    Crash course on JVM Memory Management:

    JVM provides automatic memory management. In languages like C, developers manage memory explicitly using functions like…

  • Cassandra 5.0 : ACID Transactions, Vector Search and much more

    Cassandra 5.0 : ACID Transactions, Vector Search and much more

    Introduction In a recent podcast discussion with Patrick McFadin, VP of Developer Relations at DataStax, we delved into…

    5 条评论
  • Capture the Change, you want to see in the database.

    Capture the Change, you want to see in the database.

    Change Data Capture As the name suggests, it is about capturing changes to your data. By capturing, I mean reacting to…

    14 条评论
  • Functional Programming on the JVM

    Functional Programming on the JVM

    Functional Programming is a programming paradigm which involves breaking down a "giant" functionality into smaller…

    3 条评论
  • Distributing SQL Databases Globally

    Distributing SQL Databases Globally

    There are several reasons, why you would want to distributed your database. Keep your data closer to your customers.

    9 条评论
  • Why is DynamoDB AWSome?

    Why is DynamoDB AWSome?

    What is DynamoDB? A cloud NoSQL database service that guarantees consistent performance at any scale. Consistent…

    4 条评论
  • Designing Instagram, Linkedin, Facebook like applications

    Designing Instagram, Linkedin, Facebook like applications

    Hey Everyone, Welcome to the first article of The GeekNarrator newsletter, I am excited to start this newsletter along…

  • 5 things I learned from Hack-Week

    5 things I learned from Hack-Week

    Last few days we spent on hacking things and building something interesting, useful and which could not be done as part…

  • Readable Code : Just like a fairy tale

    Readable Code : Just like a fairy tale

    Most software today is very much like an Egyptian pyramid with millions of bricks piled on top of each other, with no…

    7 条评论
  • Work smart(less), achieve big. After all you are an engineer.

    Work smart(less), achieve big. After all you are an engineer.

    In the current world, where IT industry is booming, people often miss the bigger picture.Everyone is running, to…

社区洞察

其他会员也浏览了