Challenges faced while working with Distributed Systems

Challenges faced while working with Distributed Systems

Introduction

When we talk about issues or faults encountered while working with the Distributed System, they can be very different from? working with a single computer/machine.

On a single computer an operation performed will always produce the same result. These systems are not meant to be flaky. Either the system will execute an operation successfully or will fail entirely. There is no middle way or a state of partial failure in these systems.

For Example: if there is a hardware problem such as Memory Corruption or an Unplugged Wire, the consequence can generally be a total system failure.

In Distributed Systems, we are generally working with multiple machines connected across the network. There can be a huge chance of partial failure in the system. There can be some parts of the system that are broken in unpredictable ways while other parts of the system are working fine.

Single computer generally serves a single user or a very small group of users. Hence, we can go on with completely failing the entire system instead of handling a partial failure. We can get the system backed up and functioning again after resolving the failure, since down-time is non-critical.

But Distributed Systems generally serves a huge user-base that can be scattered all over the globe. Hence, a little downtime can be critical. We can not afford to shut down the entire system incase of a partial failure (failure of a couple of nodes or a network failure). Our system should be fault tolerant i.e. it should be able to serve its user base even though some parts of the system are broken in some unpredictable ways.


Unreliability in Distributed Systems

Distributed Systems can be unreliable in many ways. Multiple nodes share information or a message packet to several other nodes asynchronously over the network. But the network makes no guarantee that the packet will be received by the destination node.

There are many things that can go wrong:

  1. Network Failure: Your packet may get lost in the network while being sent from the source to the destination node. The destination node may have processed the request but the acknowledgement got lost in the network.

No alt text provided for this image
Acknowledgement packet got lost due to Network failure


2. Packet Queued: Your packet may be waiting in a queue at the destination node to get processed since the network or recipient is overloaded. Your packet is processed by the recipient node but the acknowledgement packet is queued since the network or requestor node is overloaded.


No alt text provided for this image
Packet waiting in Queue at the receiver Node


3. Recipient Node failure: The recipient node may completely or temporarily stopped responding due to a garbage collection pause.


No alt text provided for this image
Destination Node crashed


Looking at all the previous failure cases, we can say that it is difficult for a node to tell whether its packet was delivered and processed successfully or not. The usual way of handling this issue is through Timeout. The sender node waits for sometime after sending the packet to a node and then gives up assuming the recipient node crashed.


Premature failure declaration of Nodes

In the previous section we discussed Timeout as one way of handling the uncertainties in the Distributed Systems. But for how long shall we wait before declaring a node dead?

A long timeout means a longer wait time until the node is declared dead. This also means an increase in the response time. This can lead to a terrible user experience.

A shorter timeout could cause a Node to be declared dead which in reality was just functioning slow due to increased load. Prematurely declaring a Node dead can cause problems.

  1. Multiple occurrence of events: A node may be in the middle of performing some actions which can’t be rolled back, such as sending an email to a group of users. In the meantime the node was prematurely declared dead and another node took over. In such cases the action of sending email can end up performed twice.
  2. Cascading Failures: Suppose there is a heavy load over the system due to which the majority of nodes are suffering a temporary slow-down. Some nodes are declared pre-maturely dead and their load is then transferred to the remaining nodes. This can lead to a cascading failure causing the remaining nodes to function slower and end up declared prematurely dead as well. This can finally cause the entire system to fail.


No alt text provided for this image


Multiple Leader Problem

Distributed Systems work on the concept of Quorum. This means voting by the nodes present in the system and the majority get to make the decision.

Suppose initially a node was elected as a leader but later was declared pre-maturely dead by other nodes. While it was declared dead, it got demoted and the remaining nodes elected a different node as their new leader.

But when the earlier declared dead node came back to function, then it may still think of itself as a leader and could do something incorrect.


Problem Statement

Suppose there is a file which can be edited by a single node at a time. Hence we have a locking service which allows one client to get lock on the file.


No alt text provided for this image


Here, Node-1 got the lock on the file but after that it got paused and was declared pre-maturely dead. The lock expired and a new node (Node-2) got the lock over the file. After getting the lock, it tries to write the data onto the file. In the meantime Node-1 got active and had no idea that its lock had expired. Unknowingly Node-1 also tries to write on the file causing data corruption.


Token Based Solution

Suppose every time a Node gets a lock over the file, the lock service also returns a token which needs to be evaluated while writing over the file. This token is a number which increases with every lock.

If a node fails to validate the token then its write request on the file is rejected.


No alt text provided for this image


Conclusion

We discussed multiple challenges faced while working with the Distributed Systems. We looked around the problems of premature failure declaration of nodes and its dangerous impact on the system. At the end we also discussed the multiple leader issue in a distributed system and a token based solution for the same.

Meanwhile what you all can do is to Like and Share this edition among your peers and also subscribe to this Newsletter so that you all can get notified when I come up with more content in future. Share this Newsletter with anyone who might be benefitted from this content.

Until next time, Dive Deep and Keep Learning!

Swarnabha Dutta

Intern at Prodigy InfoTech | Aspiring Software Engineer | Mastering Data Structures & Algorithms | Full-Stack MERN Developer | Building Scalable Web Solutions

1 年

Saurav Prateek Thank you for this valuable information

Caleb Adewole

Backend Software Engineer @Cudium | Interested in System Engineering and Distributed Systems

1 年

Saurav Prateek, ?? thank you for the article, was a good one. Quick question, please after reading through the article which indicates that a lot of things could go wrong when working with distributed systems. How should one handle cases with a third-party payment provider which on a few occasions seems to process already rejected transactions? Is there a way to prevent such a thing from one service before sending a transaction to such a third party for processing or is the settlement route the only way ?

要查看或添加评论,请登录

Saurav Prateek的更多文章

社区洞察

其他会员也浏览了