登录查看更多内容

What is High Availability in Clusters ?

Harsh Gupta

DevOps?? | Cloud?? | Linux (RHEL)?? ? AWS / AZURE? ? Kubernetes?? ? Docker?? ? Ansible??? ? Jenkins/GitLab???? ? Terraform?? ? Openshift? ? Grafana?? ? Prometheus?? ? Mern Full Stack ? GitOps?? |

发布日期: 2024年1月18日

In Docker Swarm, there are two types of nodes: manager nodes and worker nodes. Manager nodes are responsible for managing the swarm, including tasks like scheduling tasks to worker nodes and maintaining the swarm's state. Worker nodes, on the other hand, are responsible for executing tasks that are assigned to them by manager nodes.

In every swarm, there is a single leader node that is responsible for making all of the swarm management and orchestration decisions. The leader node is elected by the other manager nodes in the swarm using the Raft consensus algorithm. If the leader node goes down, the remaining manager nodes will elect a new leader.

Difference between Leader & Manager node ?

Now for simple docker swarm mode , there is a single manager and other are worker node. in this manager is a leader.

It is possible to have more then one manager node. like 2 manager ( mostly odd number prefer like 1,3,5). in such case to one is leader who is responsible to scheduler task on worker node. also manager node will talk to each other to maintain the state. to make highly available environment when manager node which is a leader at this moment get down , it should not stop scheduling work. at that moment another manager will automatically promoted as a leader and take responsibility to schedule task (container) on worked node.

High Availability

If one host in the cluster goes down, the other hosts can continue to run your applications refer as High Availability.

A highly available Docker Swarm setup ensures that if a node fails, services on the failed node are re-provisioned and assigned to other available nodes in the cluster. A Docker Swarm setup that consists of one or two manager nodes is not considered highly available because any incident will cause operations on the cluster to be interrupted. Therefore the minimum number of manager nodes in a highly available Swarm cluster should be three.

What happen if we have two manager node and one fails to work ?

In this case the entire cluster disrupts. Because we have to maintain an odd number of managers in the swarm to support manager node failures.

Hence it is important to understand some key features of manager nodes to properly deploy and maintain the swarm.

Let's Understand Quorum

In a cluster, quorum is the majority of voting nodes in the active cluster membership plus a witness vote. Quorum ensures that there is only one owner of a particular resource at a time. It also acts as a definitive repository for the configuration information of physical clusters.?

Quorum tells the cluster which node should be active at any given time. It also intervenes if communications fail between cluster nodes by determining which set of nodes gets to run the application at hand.? Quorum checks for the minimum number of votes required to have a majority and own the resources. Each cluster node is allowed to cast its single vote.

ELECTIONS

In a Docker Swarm cluster, an election is held to determine the new leader when the current leader fails. The election process is based on the Raft consensus algorithm, which is a distributed consensus algorithm that ensures that all nodes in the cluster agree on the state of the cluster. The Raft algorithm works by dividing the nodes in the cluster into two groups: the candidates and the voters. The candidates are the nodes that are running for leader, and the voters are the nodes that are voting for a leader.

Raft Consensus Algorithm

The ‘Raft Consensus Algorithm’ is used to manage the swarm state. Using this ‘Consensus’ method amongst the management nodes is designed to be sure that in the event of a failure of any manager, any other manager will have enough information stored on it to continue to operation the swarm as expected.

Raft tolerates up to (N-1/2) failures and requires a majority of (N/2)+1 to agree on any new instructions that are proposed to the cluster for execution.

What happen if there is more than one Leader ?

If such situation occurs than there arise a Split Brain Problem. Split brain is a state in a server cluster where nodes have conflicts when handling incoming I/O operations. This can occur when a highly available system becomes fragmented due to a network partition. This means that the nodes in the cluster lose connectivity with each other. Each node then believes that it is the primary node responsible for serving requests.?

领英推荐

How To Test Traffic With A Custom Kubernetes Controller

Keploy ?? 2 个月前

Deploying an On-Premise Kubernetes Cluster with…

ITGix Ltd 1 个月前

Mastering AI/ML Infrastructure Scalability: Key…

OpenTeams 9 个月前

Fault Tolerance

If the swarm loses the quorum of managers, the swarm cannot perform management tasks. If your swarm has multiple managers, always have more than two. To maintain quorum, a majority of managers must be available. An odd number of managers is recommended, because the next even number does not make the quorum easier to keep.

Keeping the quorum is not guaranteed if you encounter more than two network partitions.

Below is the overview of fault tolerance Capacity of the cluster that varies with Swarm size.

For instance, in a swarm with 5 nodes, if you lose 3 nodes, you don't have a quorum. Therefore you can't add or remove nodes until you recover one of the unavailable manager nodes or recover the swarm with disaster recovery commands.

Run manager-only nodes

By default manager nodes also act as a worker nodes. This means the scheduler can assign tasks to a manager node.

To avoid interference with manager node operation, you can drain manager nodes to make them unavailable as worker nodes:

docker node update --availability drain <node>

Changing node availability lets you:

Drain a manager node so that it only performs swarm management tasks and is unavailable for task assignment.
Drain a node so you can take it down for maintenance.
Pause a node so it can't receive new tasks.

Promote or demote a node

You can promote a worker node to the manager role. This is useful when a manager node becomes unavailable or if you want to take a manager offline for maintenance.

To promote a set of node in docker, we use docker node promote command:

docker node promote <node>

To demote a set of nodes, run

docker node demote <node>

Instead of promote or demote command we can also use update command of docker node. Like or to promote or demote.

要查看或添加评论，请登录

Harsh Gupta的更多文章

Paytm’s Success with AWS Graviton: A Journey to Efficiency

2025年1月3日

Paytm’s Success with AWS Graviton: A Journey to Efficiency

Paytm’s main goal was to optimize their infrastructure. Like many large-scale businesses, they wanted to save costs…

1 条评论
??++ ????????????

2024年1月24日

??++ ????????????

?? ???????? ???? ?????????? ???????? ???????????????? ???????? ???? ???????????? ???????? ???????? ??????…
Unveiling the Kubernetes Learnings So Far

2024年1月21日

Unveiling the Kubernetes Learnings So Far

INTRODUCTION Embarking on the Kubernetes journey has been a fascinating experience, and as a learner, I'm eager to…

1 条评论
Ansible-Tower: Use-Cases, Components & Installation ??

2024年1月19日

Ansible-Tower: Use-Cases, Components & Installation ??

INTRODUCTION WHAT IS ANSIBLE-TOWER ? Ansible Tower is the enterprise version of Ansible, and it helps organizations and…
Case-Study: How AppDirect Supported the 10x Growth with Kubernetes.

2024年1月16日

Case-Study: How AppDirect Supported the 10x Growth with Kubernetes.

INTRODUCTION Embarking on a transformative journey, we unravel the compelling case study of how App Direct, a…
Deploying WordPress on Docker Swarm ??

2024年1月15日

Deploying WordPress on Docker Swarm ??

Introduction Embarking on the journey of deploying WordPress on Docker Swarm using Docker Stack opens up a realm of…

1 条评论
Enabling SSH in Docker Container: A Step-by-Step Guide

2023年12月31日

Enabling SSH in Docker Container: A Step-by-Step Guide

Introduction SSH (Secure Shell) is a widely used protocol for secure remote access to machines. In some scenarios, you…

See all articles

What is High Availability in Clusters ?

Harsh Gupta

DevOps?? | Cloud?? | Linux (RHEL)?? ? AWS / AZURE? ? Kubernetes?? ? Docker?? ? Ansible??? ? Jenkins/GitLab???? ? Terraform?? ? Openshift? ? Grafana?? ? Prometheus?? ? Mern Full Stack ? GitOps?? |

Difference between Leader & Manager node ?

High Availability

What happen if we have two manager node and one fails to work ?

Let's Understand Quorum

ELECTIONS

Raft Consensus Algorithm

What happen if there is more than one Leader ?

领英推荐

Fault Tolerance

Run manager-only nodes

Promote or demote a node

Harsh Gupta的更多文章

社区洞察

其他会员也浏览了

Kubernets (K8s) Architecture and its Comparison with Docker and Docker Swarm Architecture

Ensuring High Availability in Kubernetes Clusters: Best Practices and Implementation Guide

What is a service mesh?

Linode Kubernetes Engine (LKE)

Gateway API vs Ingress Controller in Kubernetes

?? What Happens in the Background When You Create a Persistent Volume (PV) and Persistent Volume Claim (PVC) in Kubernetes?

etcd: The Unsung Hero Powering Kubernetes and How to Back It Up

Kubernetes: Orchestrating Containers at Scale

Distributed Systems: Exploring Architecture Styles

IPFS Clustering with Kubernetes: Advancing Decentralized File Sharing through Resilient Architecture

Difference between Leader & Manager node ?

High Availability

What happen if we have two manager node and one fails to work ?

Let's Understand Quorum

ELECTIONS

Raft Consensus Algorithm

What happen if there is more than one Leader ?

领英推荐

Fault Tolerance

Run manager-only nodes

Promote or demote a node

Harsh Gupta的更多文章

Paytm’s Success with AWS Graviton: A Journey to Efficiency

??++ ????????????

Unveiling the Kubernetes Learnings So Far

Ansible-Tower: Use-Cases, Components & Installation ??

Case-Study: How AppDirect Supported the 10x Growth with Kubernetes.

Deploying WordPress on Docker Swarm ??

Enabling SSH in Docker Container: A Step-by-Step Guide

社区洞察

其他会员也浏览了

Kubernets (K8s) Architecture and its Comparison with Docker and Docker Swarm Architecture

Ensuring High Availability in Kubernetes Clusters: Best Practices and Implementation Guide

What is a service mesh?

Linode Kubernetes Engine (LKE)

Gateway API vs Ingress Controller in Kubernetes

?? What Happens in the Background When You Create a Persistent Volume (PV) and Persistent Volume Claim (PVC) in Kubernetes?

etcd: The Unsung Hero Powering Kubernetes and How to Back It Up

Kubernetes: Orchestrating Containers at Scale

Distributed Systems: Exploring Architecture Styles

IPFS Clustering with Kubernetes: Advancing Decentralized File Sharing through Resilient Architecture