登录查看更多内容

Challenges faced while working with Distributed Systems

Saurav Prateek

Engineer @ Google | Ex-SWE @ GeeksForGeeks | Authoring engineering newsletter with 30K+ Subs | 60K+ Linkedin | Content Creator | Mentor

发布日期: 2023年4月3日

Introduction

When we talk about issues or faults encountered while working with the Distributed System, they can be very different from? working with a single computer/machine.

On a single computer an operation performed will always produce the same result. These systems are not meant to be flaky. Either the system will execute an operation successfully or will fail entirely. There is no middle way or a state of partial failure in these systems.

For Example: if there is a hardware problem such as Memory Corruption or an Unplugged Wire, the consequence can generally be a total system failure.

In Distributed Systems, we are generally working with multiple machines connected across the network. There can be a huge chance of partial failure in the system. There can be some parts of the system that are broken in unpredictable ways while other parts of the system are working fine.

Single computer generally serves a single user or a very small group of users. Hence, we can go on with completely failing the entire system instead of handling a partial failure. We can get the system backed up and functioning again after resolving the failure, since down-time is non-critical.

But Distributed Systems generally serves a huge user-base that can be scattered all over the globe. Hence, a little downtime can be critical. We can not afford to shut down the entire system incase of a partial failure (failure of a couple of nodes or a network failure). Our system should be fault tolerant i.e. it should be able to serve its user base even though some parts of the system are broken in some unpredictable ways.

Unreliability in Distributed Systems

Distributed Systems can be unreliable in many ways. Multiple nodes share information or a message packet to several other nodes asynchronously over the network. But the network makes no guarantee that the packet will be received by the destination node.

There are many things that can go wrong:

Network Failure: Your packet may get lost in the network while being sent from the source to the destination node. The destination node may have processed the request but the acknowledgement got lost in the network.

No alt text provided for this image — Acknowledgement packet got lost due to Network failure

2. Packet Queued: Your packet may be waiting in a queue at the destination node to get processed since the network or recipient is overloaded. Your packet is processed by the recipient node but the acknowledgement packet is queued since the network or requestor node is overloaded.

3. Recipient Node failure: The recipient node may completely or temporarily stopped responding due to a garbage collection pause.

Looking at all the previous failure cases, we can say that it is difficult for a node to tell whether its packet was delivered and processed successfully or not. The usual way of handling this issue is through Timeout. The sender node waits for sometime after sending the packet to a node and then gives up assuming the recipient node crashed.

Premature failure declaration of Nodes

In the previous section we discussed Timeout as one way of handling the uncertainties in the Distributed Systems. But for how long shall we wait before declaring a node dead?

A long timeout means a longer wait time until the node is declared dead. This also means an increase in the response time. This can lead to a terrible user experience.

领英推荐

The Cost of Implementing a Multi-LLM Environment…

Machint Solutions 9 个月前

?? Leading Decentralized Storage Solutions in Web3.

DroomDroom 3 周前

Transforming Your IT Environment with IBM Power10

Sentia 6 个月前

A shorter timeout could cause a Node to be declared dead which in reality was just functioning slow due to increased load. Prematurely declaring a Node dead can cause problems.

Multiple occurrence of events: A node may be in the middle of performing some actions which can’t be rolled back, such as sending an email to a group of users. In the meantime the node was prematurely declared dead and another node took over. In such cases the action of sending email can end up performed twice.
Cascading Failures: Suppose there is a heavy load over the system due to which the majority of nodes are suffering a temporary slow-down. Some nodes are declared pre-maturely dead and their load is then transferred to the remaining nodes. This can lead to a cascading failure causing the remaining nodes to function slower and end up declared prematurely dead as well. This can finally cause the entire system to fail.

Multiple Leader Problem

Distributed Systems work on the concept of Quorum. This means voting by the nodes present in the system and the majority get to make the decision.

Suppose initially a node was elected as a leader but later was declared pre-maturely dead by other nodes. While it was declared dead, it got demoted and the remaining nodes elected a different node as their new leader.

But when the earlier declared dead node came back to function, then it may still think of itself as a leader and could do something incorrect.

Problem Statement

Suppose there is a file which can be edited by a single node at a time. Hence we have a locking service which allows one client to get lock on the file.

Here, Node-1 got the lock on the file but after that it got paused and was declared pre-maturely dead. The lock expired and a new node (Node-2) got the lock over the file. After getting the lock, it tries to write the data onto the file. In the meantime Node-1 got active and had no idea that its lock had expired. Unknowingly Node-1 also tries to write on the file causing data corruption.

Token Based Solution

Suppose every time a Node gets a lock over the file, the lock service also returns a token which needs to be evaluated while writing over the file. This token is a number which increases with every lock.

If a node fails to validate the token then its write request on the file is rejected.

Conclusion

We discussed multiple challenges faced while working with the Distributed Systems. We looked around the problems of premature failure declaration of nodes and its dangerous impact on the system. At the end we also discussed the multiple leader issue in a distributed system and a token based solution for the same.

Meanwhile what you all can do is to Like and Share this edition among your peers and also subscribe to this Newsletter so that you all can get notified when I come up with more content in future. Share this Newsletter with anyone who might be benefitted from this content.

Until next time, Dive Deep and Keep Learning!

Systems That Scale

30,642 位关注者

Swarnabha Dutta

Intern at Prodigy InfoTech | Aspiring Software Engineer | Mastering Data Structures & Algorithms | Full-Stack MERN Developer | Building Scalable Web Solutions

1 年

Saurav Prateek Thank you for this valuable information

2 次回应

Caleb Adewole

Backend Software Engineer @Cudium | Interested in System Engineering and Distributed Systems

1 年

Saurav Prateek, ?? thank you for the article, was a good one. Quick question, please after reading through the article which indicates that a lot of things could go wrong when working with distributed systems. How should one handle cases with a third-party payment provider which on a few occasions seems to process already rejected transactions? Is there a way to prevent such a thing from one service before sending a transaction to such a third party for processing or is the settlement route the only way ?

1 次回应

查看更多评论

要查看或添加评论，请登录

Saurav Prateek的更多文章

Parallel execution of nodes in LangGraph - Enhancing the performance of your graph workflows

2025年3月7日

Parallel execution of nodes in LangGraph - Enhancing the performance of your graph workflows

Introduction Parallel execution of nodes is essential to speed up overall graph operation. LangGraph offers native…

8 条评论
Dissecting Forward Propagation in Neural Networks

2025年2月15日

Dissecting Forward Propagation in Neural Networks

Introduction Forward Propagation is the process where the input parameters are passed through the Layers present in the…

2 条评论
Dissecting Backpropagation in Neural Networks

2025年2月9日

Dissecting Backpropagation in Neural Networks

Introduction In machine learning, backpropagation is a gradient estimation method commonly used for training a Neural…
A Deep Neural Network from scratch - Micrograd implemented in Java

2025年1月29日

A Deep Neural Network from scratch - Micrograd implemented in Java

Introduction micrograd is an Autograd engine developed by Andrej Kerpathy. This repo covers the Java implementation of…

5 条评论
Building Agentic RAG from scratch - A Youtube playlist

2024年10月2日

Building Agentic RAG from scratch - A Youtube playlist

In this edition we will talk around my Youtube playlist on "Building an Agentic Retrieval Augmented Generation…
Tool Calling with LangChain - Do more with your AI agents

2024年9月22日

Tool Calling with LangChain - Do more with your AI agents

Introduction In this edition we will understand the concept of Tool calling with LangChain. Tool Calling is the concept…

6 条评论
Evaluating our Retrieval Augmented Generation (RAG) framework’s performance

2024年9月13日

Evaluating our Retrieval Augmented Generation (RAG) framework’s performance

Introduction We have discussed how to create a Retrieval Augmented Generation (RAG) framework in one of our previous…

5 条评论
Hallucination in our Retrieval Augmented Generation (RAG) framework

2024年9月8日

Hallucination in our Retrieval Augmented Generation (RAG) framework

Introduction In one of our previous articles we discussed how we can build a Retrieval Augmented Generation (RAG)…

9 条评论
Building a Document Grader in LangGraph | Prompt Templates and Conditional Edges in LangChain

2024年9月1日

Building a Document Grader in LangGraph | Prompt Templates and Conditional Edges in LangChain

Introduction In our previous article we built a multi-agent workflow that grades RAG framework performance using…

4 条评论
LangGraph Architecture that grades RAG framework’s performance

2024年8月26日

LangGraph Architecture that grades RAG framework’s performance

Introduction In our previous article we discussed how we can build a Retrieval-Augmented Generation (RAG) framework…

2 条评论

See all articles

Challenges faced while working with Distributed Systems

Saurav Prateek

Engineer @ Google | Ex-SWE @ GeeksForGeeks | Authoring engineering newsletter with 30K+ Subs | 60K+ Linkedin | Content Creator | Mentor

Introduction

Unreliability in Distributed Systems

Premature failure declaration of Nodes

领英推荐

Multiple Leader Problem

Problem Statement

Token Based Solution

Conclusion

Systems That Scale

30,642 位关注者

Saurav Prateek的更多文章

社区洞察

其他会员也浏览了

CESS Mechanism (1) the Multi-Layer Network Architecture Design

How to set up and manage a Hyper-V Failover Cluster, Step by step

Distributed System Design Patterns

Fault tolerance in distributed systems

Scalable Service-Oriented Middleware over IP - An Introduction

Securing Critical Infrastructure: A Human-Centric Approach Across the AI Supply Chain

Replicated State Machines - Ensuring Fault Tolerance & High Availability

RoCE vs. InfiniBand: Shocking Data Center Switch Test Results and Environment Reveals

Demystifying the CAP Theorem in Distributed Systems ????

Scalable Service-Oriented Middleware over IP(SOME/IP)

Introduction

Unreliability in Distributed Systems

Premature failure declaration of Nodes

领英推荐

Multiple Leader Problem

Problem Statement

Token Based Solution

Conclusion

Systems That Scale

30,642 位关注者

Saurav Prateek的更多文章

Parallel execution of nodes in LangGraph - Enhancing the performance of your graph workflows

Dissecting Forward Propagation in Neural Networks

Dissecting Backpropagation in Neural Networks

A Deep Neural Network from scratch - Micrograd implemented in Java

Building Agentic RAG from scratch - A Youtube playlist

Tool Calling with LangChain - Do more with your AI agents

Evaluating our Retrieval Augmented Generation (RAG) framework’s performance

Hallucination in our Retrieval Augmented Generation (RAG) framework

Building a Document Grader in LangGraph | Prompt Templates and Conditional Edges in LangChain

LangGraph Architecture that grades RAG framework’s performance

社区洞察

其他会员也浏览了

CESS Mechanism (1) the Multi-Layer Network Architecture Design

How to set up and manage a Hyper-V Failover Cluster, Step by step

Distributed System Design Patterns

Fault tolerance in distributed systems

Scalable Service-Oriented Middleware over IP - An Introduction

Securing Critical Infrastructure: A Human-Centric Approach Across the AI Supply Chain

Replicated State Machines - Ensuring Fault Tolerance & High Availability

RoCE vs. InfiniBand: Shocking Data Center Switch Test Results and Environment Reveals

Demystifying the CAP Theorem in Distributed Systems ????

Scalable Service-Oriented Middleware over IP(SOME/IP)