Consistent Hashing: Architecture Pattern

Consistent Hashing: Architecture Pattern

Imagine you are building a shared distributed system that needs to store and retrieve data across multiple servers. One of the main challenges is deciding which server should be responsible for storing a given piece of data. Traditional approaches like modulo-based hashing often lead to inefficient data distribution and expensive rebalancing when the number of servers changes.

Problem with Modulo-Based Hashing

Let's illustrate the problems with modulo-based hashing using a simple example of a distributed system with four servers (S1, S2, S3, and S4) and a dataset of five items (A, B, C, D, and E). We'll use a modulo-based hashing technique to assign data items to servers based on their hash values.

Step 1: Calculate Hash Values Assuming our hash function assigns the following hash values to the data items:
A -> hash(A) = 11 B -> hash(B) = 20 C -> hash(C) = 32 D -> hash(D) = 45 E -> hash(E) = 50
Step 2: Assign Data Items to Servers Now, we distribute the data items to servers using the modulo-based hashing technique. Let's say we use the modulo operator (%) to determine the server for each data item:
A -> S1 (hash(A) % 4 = 11 % 4 = 3) 
B -> S2 (hash(B) % 4 = 20 % 4 = 0) 
C -> S2 (hash(C) % 4 = 32 % 4 = 0) 
D -> S3 (hash(D) % 4 = 45 % 4 = 1) 
E -> S4 (hash(E) % 4 = 50 % 4 = 2)        

Data Distribution:

S1: A S2: B,C S3: D S4: E

Problem 1: Inefficient Data Redistribution

Now, let's assume we add a new server, S5, to the system. According to the modulo-based hashing, the data distribution will change as follows:

A -> S1 (hash(A) % 5 = 11 % 5 = 1) 
B -> S2 (hash(B) % 5 = 20 % 5 = 0) 
C -> S4 (hash(C) % 5 = 32 % 5 = 2) 
D -> S2 (hash(D) % 5 = 45 % 5 = 0) 
E -> S2 (hash(E) % 5 = 50 % 5 = 0)        

Updated Data Distribution:

S1: A S2: B,D,E S4: C S3: S5:

As we can see, there is an inefficient data distribution. Ideally, we’d not have wanted data movement from servers that are not overloaded(anything except S2). However, we see that the new server wasn’t utilized at all and data was also moved away from S2. This process becomes even more problematic as the system scales and more servers are added, leading to inefficiency and resource waste.

Problem 2: Imbalanced Load and Hotspots

Modulo-based hashing uses the remainder of the hash value divided by the total number of servers to determine the server responsible for storing a data item. This can lead to uneven data distribution, especially when the number of servers is not a power of 2 or is not evenly divisible by the number of data items. Certain servers may become overloaded with data, while others remain underutilized, resulting in poor load balancing. You could see the same in the example mentioned above.

The modulo-based hashing approach has evident problems when it comes to dynamic distributed systems. To overcome these problems, Consistent Hashing provides a better solution, ensuring better data distribution, load balancing, and fault tolerance in distributed systems.

Consistent Hashing

Consistent Hashing is a distributed hashing technique that minimizes data redistribution when the number of servers changes. The idea is to map both data and servers onto a circle, which is represented as a hash ring.

Each server is associated with one or more points on the ring, and data is placed on the ring based on its hash value. To find the server responsible for a specific piece of data, one simply needs to find the closest server in the clockwise direction on the ring.

Handling Server Additions and Removals

Step 1: Setting up the Initial Configuration We start with four servers (S1, S2, S3, and S4) and distribute the data items (A, B, C, D, and E) using Consistent Hashing.
Step 2: Assign Data Items to Servers using Consistent Hashing We use a hash function to map the data items and servers onto a hash ring, and each server is associated with one or more points on the ring. Mapping servers to multiple points in the ring(Virtual Nodes) reduces the chances of a skewed data distribution in your servers.

Hash Ring:

S1 -> 10, 30, 50 
S2 -> 20, 40, 60 
S3 -> 70, 90, 110 
S4 -> 80, 100, 120        
No alt text provided for this image
Hash Ring


Data Items:

A -> hash(A) = 25 
B -> hash(B) = 55 
C -> hash(C) = 85 
D -> hash(D) = 115 
E -> hash(E) = 35        

Data Distribution:

S1 -> A 
S2 -> B, E
S3 -> C 
S4 -> D        
Step 3: Handling Server Addition (Adding S5) Let's add a new server, S5, to the system. To distribute the data efficiently, we place S5 on the hash ring as well.

Hash Ring (with S5):

S1 -> 10, 30, 50 
S2 -> 20, 40, 60 
S3 -> 70, 90, 110 
S4 -> 80, 100, 120 
S5 -> 55, 75, 105        

Data Distribution after adding S5:

S1 -> A
S2 -> E
S3 -> C 
S4 -> D
S5 -> B         
No alt text provided for this image
Hash Ring After adding S5

As we can see, the addition of S5 affected only data item B, which is now mapped to S5. The rest of the data remains unaffected. This demonstrates the efficiency of Consistent Hashing in handling server additions with minimal data migration.

Step 4: Handling Server Removal (Removing S2) Now, let's remove server S2 from the system and redistribute the data accordingly.

Hash Ring (after removing S2):

S1 -> 10, 30, 50 
S3 -> 70, 90, 110 
S4 -> 80, 100, 120 
S5 -> 55, 75, 105        

Data Distribution after removing S2:

S1 -> A, E 
S5 -> B
S3 -> C 
S4 -> D        
No alt text provided for this image
Hash Ring With S2 Removed

The removal of S2 affected only the data items that were previously mapped to S2, which is E. It has now been redistributed to the neighbouring server S1. The rest of the data remains unchanged.

Advantages of Consistent Hashing

  1. Load Balancing: With Consistent Hashing, data distribution across servers tends to be balanced. When a new server is added or an existing one is removed, only a small fraction of the data needs to be moved, making the system highly scalable and efficient.
  2. Fault Tolerance: In Consistent Hashing, the loss of a server affects only the data mapped to that server's location on the ring. The rest of the data remains unaffected, and the system can gracefully handle the failure without causing a complete disruption.
  3. Easy to Implement: The concept of Consistent Hashing is relatively simple to understand and implement. It has become a fundamental technique used in many distributed systems like distributed caching, content delivery networks (CDNs), peer-to-peer networks, and load balancers.

Disadvantages of Consistent Hashing

While there are no major disadvantages to using Consistent hashing, the following points are important to highlight, to establish that consistent hashing helps in reducing issues of data distribution, not eliminate them!

  1. Overhead for Virtual Nodes: Virtual nodes (replicas) are often used in Consistent Hashing to achieve better data distribution. These virtual nodes introduce additional overhead in terms of memory and computational resources, as each physical server needs to manage multiple virtual nodes.
  2. Inefficient for Small Datasets: In scenarios where the dataset is relatively small or the number of servers is limited, Consistent Hashing may not provide significant advantages over simpler hashing techniques like modulo-based hashing. The overhead introduced by Consistent Hashing could outweigh its benefits for small-scale systems.
  3. Scalability with High Churn: In dynamic systems with frequent node additions and removals (high churn rate), Consistent Hashing may experience increased overhead due to the need for frequent data redistribution. High churn can impact system stability and performance, requiring careful management.
  4. Inability to Consider Node Capacity: Consistent Hashing primarily focuses on balancing data distribution among nodes. However, it doesn't take into account the varying capacity of individual nodes. In practical scenarios, nodes may have different hardware capabilities or storage capacities, and Consistent Hashing may not fully optimize resource utilization in such cases. You might prefer homogeneous nodes when working with Consistent Hashing!


Thank you for reading! I’ll be posting weekly content on distributed systems & patterns, so please like, share and subscribe to this?newsletter ?for notifications of new posts.

Please comment on the post with your feedback, it will help me improve! :)

Until next time, Keep asking questions & Keep learning!

Pratik Pandey

Senior Software Engineer at Booking.com | AWS Serverless Community Builder | pratikpandey.substack.com

1 年
回复
Vishnu Walke

Lifestyle Disease Specialist | Mathematics Teacher

1 年

Intriguing article on consistent hashing! Being from an engineering background, it's fascinating to see such efficient solutions in distributed systems. Can you share how it compares to other hashing techniques? Keep up the great work!

Sri Chavali

Engineering @ Oracle || x-Microsoft, VMware || Database Internals || Distributed systems || Build and scale data systems || Real-time Analytics || Big Data || ML and Deep learning |

1 年

Excellent Article, Consistent hashing is mainly used for Load balancing between microservices.? Databases mostly use Hash partitioning or Key-range partitioning with a pre-defined fixed number of partitions irrespective of the number of nodes. Cassandra randomly chooses a fixed number of partitions and splits the Hash key ranges whenever a new Node is added or deleted. It uses dynamic partitioning similar to consistent hashing.

Shrey Batra

Founder @ Cosmocloud, Ex-LinkedIn, Angel Investor, MongoDB Champion, Book Author, Patent Holder (Distributed Algorithms)

1 年

++

要查看或添加评论,请登录

Pratik Pandey的更多文章

社区洞察

其他会员也浏览了