What is High Availability in Clusters ?

What is High Availability in Clusters ?

In Docker Swarm, there are two types of nodes: manager nodes and worker nodes. Manager nodes are responsible for managing the swarm, including tasks like scheduling tasks to worker nodes and maintaining the swarm's state. Worker nodes, on the other hand, are responsible for executing tasks that are assigned to them by manager nodes.

In every swarm, there is a single leader node that is responsible for making all of the swarm management and orchestration decisions. The leader node is elected by the other manager nodes in the swarm using the Raft consensus algorithm. If the leader node goes down, the remaining manager nodes will elect a new leader.

Difference between Leader & Manager node ?

Now for simple docker swarm mode , there is a single manager and other are worker node. in this manager is a leader.

It is possible to have more then one manager node. like 2 manager ( mostly odd number prefer like 1,3,5). in such case to one is leader who is responsible to scheduler task on worker node. also manager node will talk to each other to maintain the state. to make highly available environment when manager node which is a leader at this moment get down , it should not stop scheduling work. at that moment another manager will automatically promoted as a leader and take responsibility to schedule task (container) on worked node.

High Availability

If one host in the cluster goes down, the other hosts can continue to run your applications refer as High Availability.

A highly available Docker Swarm setup ensures that if a node fails, services on the failed node are re-provisioned and assigned to other available nodes in the cluster. A Docker Swarm setup that consists of one or two manager nodes is not considered highly available because any incident will cause operations on the cluster to be interrupted. Therefore the minimum number of manager nodes in a highly available Swarm cluster should be three.

What happen if we have two manager node and one fails to work ?

In this case the entire cluster disrupts. Because we have to maintain an odd number of managers in the swarm to support manager node failures.

Hence it is important to understand some key features of manager nodes to properly deploy and maintain the swarm.

Let's Understand Quorum

In a cluster, quorum is the majority of voting nodes in the active cluster membership plus a witness vote. Quorum ensures that there is only one owner of a particular resource at a time. It also acts as a definitive repository for the configuration information of physical clusters.?

Quorum tells the cluster which node should be active at any given time. It also intervenes if communications fail between cluster nodes by determining which set of nodes gets to run the application at hand.? Quorum checks for the minimum number of votes required to have a majority and own the resources. Each cluster node is allowed to cast its single vote.

ELECTIONS

In a Docker Swarm cluster, an election is held to determine the new leader when the current leader fails. The election process is based on the Raft consensus algorithm, which is a distributed consensus algorithm that ensures that all nodes in the cluster agree on the state of the cluster. The Raft algorithm works by dividing the nodes in the cluster into two groups: the candidates and the voters. The candidates are the nodes that are running for leader, and the voters are the nodes that are voting for a leader.

Raft Consensus Algorithm

The ‘Raft Consensus Algorithm’ is used to manage the swarm state. Using this ‘Consensus’ method amongst the management nodes is designed to be sure that in the event of a failure of any manager, any other manager will have enough information stored on it to continue to operation the swarm as expected.

Raft tolerates up to (N-1/2) failures and requires a majority of (N/2)+1 to agree on any new instructions that are proposed to the cluster for execution.

What happen if there is more than one Leader ?

If such situation occurs than there arise a Split Brain Problem. Split brain is a state in a server cluster where nodes have conflicts when handling incoming I/O operations. This can occur when a highly available system becomes fragmented due to a network partition. This means that the nodes in the cluster lose connectivity with each other. Each node then believes that it is the primary node responsible for serving requests.?

Fault Tolerance

If the swarm loses the quorum of managers, the swarm cannot perform management tasks. If your swarm has multiple managers, always have more than two. To maintain quorum, a majority of managers must be available. An odd number of managers is recommended, because the next even number does not make the quorum easier to keep.

Keeping the quorum is not guaranteed if you encounter more than two network partitions.

Below is the overview of fault tolerance Capacity of the cluster that varies with Swarm size.

For instance, in a swarm with 5 nodes, if you lose 3 nodes, you don't have a quorum. Therefore you can't add or remove nodes until you recover one of the unavailable manager nodes or recover the swarm with disaster recovery commands.

Run manager-only nodes

By default manager nodes also act as a worker nodes. This means the scheduler can assign tasks to a manager node.

To avoid interference with manager node operation, you can drain manager nodes to make them unavailable as worker nodes:

docker node update --availability drain <node>        

Changing node availability lets you:

  • Drain a manager node so that it only performs swarm management tasks and is unavailable for task assignment.
  • Drain a node so you can take it down for maintenance.
  • Pause a node so it can't receive new tasks.

Promote or demote a node

You can promote a worker node to the manager role. This is useful when a manager node becomes unavailable or if you want to take a manager offline for maintenance.

To promote a set of node in docker, we use docker node promote command:

docker node promote <node>         

To demote a set of nodes, run

docker node demote <node>        

Instead of promote or demote command we can also use update command of docker node. Like or to promote or demote.


要查看或添加评论,请登录

Harsh Gupta的更多文章

社区洞察

其他会员也浏览了