Consistent Hashing: Architecture Pattern
Pratik Pandey
Senior Software Engineer at Booking.com | AWS Serverless Community Builder | pratikpandey.substack.com
Imagine you are building a shared distributed system that needs to store and retrieve data across multiple servers. One of the main challenges is deciding which server should be responsible for storing a given piece of data. Traditional approaches like modulo-based hashing often lead to inefficient data distribution and expensive rebalancing when the number of servers changes.
Problem with Modulo-Based Hashing
Let's illustrate the problems with modulo-based hashing using a simple example of a distributed system with four servers (S1, S2, S3, and S4) and a dataset of five items (A, B, C, D, and E). We'll use a modulo-based hashing technique to assign data items to servers based on their hash values.
Step 1: Calculate Hash Values Assuming our hash function assigns the following hash values to the data items:
A -> hash(A) = 11 B -> hash(B) = 20 C -> hash(C) = 32 D -> hash(D) = 45 E -> hash(E) = 50
Step 2: Assign Data Items to Servers Now, we distribute the data items to servers using the modulo-based hashing technique. Let's say we use the modulo operator (%) to determine the server for each data item:
A -> S1 (hash(A) % 4 = 11 % 4 = 3)
B -> S2 (hash(B) % 4 = 20 % 4 = 0)
C -> S2 (hash(C) % 4 = 32 % 4 = 0)
D -> S3 (hash(D) % 4 = 45 % 4 = 1)
E -> S4 (hash(E) % 4 = 50 % 4 = 2)
Data Distribution:
S1: A S2: B,C S3: D S4: E
Problem 1: Inefficient Data Redistribution
Now, let's assume we add a new server, S5, to the system. According to the modulo-based hashing, the data distribution will change as follows:
A -> S1 (hash(A) % 5 = 11 % 5 = 1)
B -> S2 (hash(B) % 5 = 20 % 5 = 0)
C -> S4 (hash(C) % 5 = 32 % 5 = 2)
D -> S2 (hash(D) % 5 = 45 % 5 = 0)
E -> S2 (hash(E) % 5 = 50 % 5 = 0)
Updated Data Distribution:
S1: A S2: B,D,E S4: C S3: S5:
As we can see, there is an inefficient data distribution. Ideally, we’d not have wanted data movement from servers that are not overloaded(anything except S2). However, we see that the new server wasn’t utilized at all and data was also moved away from S2. This process becomes even more problematic as the system scales and more servers are added, leading to inefficiency and resource waste.
Problem 2: Imbalanced Load and Hotspots
Modulo-based hashing uses the remainder of the hash value divided by the total number of servers to determine the server responsible for storing a data item. This can lead to uneven data distribution, especially when the number of servers is not a power of 2 or is not evenly divisible by the number of data items. Certain servers may become overloaded with data, while others remain underutilized, resulting in poor load balancing. You could see the same in the example mentioned above.
The modulo-based hashing approach has evident problems when it comes to dynamic distributed systems. To overcome these problems, Consistent Hashing provides a better solution, ensuring better data distribution, load balancing, and fault tolerance in distributed systems.
Consistent Hashing
Consistent Hashing is a distributed hashing technique that minimizes data redistribution when the number of servers changes. The idea is to map both data and servers onto a circle, which is represented as a hash ring.
Each server is associated with one or more points on the ring, and data is placed on the ring based on its hash value. To find the server responsible for a specific piece of data, one simply needs to find the closest server in the clockwise direction on the ring.
Handling Server Additions and Removals
Step 1: Setting up the Initial Configuration We start with four servers (S1, S2, S3, and S4) and distribute the data items (A, B, C, D, and E) using Consistent Hashing.
Step 2: Assign Data Items to Servers using Consistent Hashing We use a hash function to map the data items and servers onto a hash ring, and each server is associated with one or more points on the ring. Mapping servers to multiple points in the ring(Virtual Nodes) reduces the chances of a skewed data distribution in your servers.
Hash Ring:
S1 -> 10, 30, 50
S2 -> 20, 40, 60
S3 -> 70, 90, 110
S4 -> 80, 100, 120
领英推荐
Data Items:
A -> hash(A) = 25
B -> hash(B) = 55
C -> hash(C) = 85
D -> hash(D) = 115
E -> hash(E) = 35
Data Distribution:
S1 -> A
S2 -> B, E
S3 -> C
S4 -> D
Step 3: Handling Server Addition (Adding S5) Let's add a new server, S5, to the system. To distribute the data efficiently, we place S5 on the hash ring as well.
Hash Ring (with S5):
S1 -> 10, 30, 50
S2 -> 20, 40, 60
S3 -> 70, 90, 110
S4 -> 80, 100, 120
S5 -> 55, 75, 105
Data Distribution after adding S5:
S1 -> A
S2 -> E
S3 -> C
S4 -> D
S5 -> B
As we can see, the addition of S5 affected only data item B, which is now mapped to S5. The rest of the data remains unaffected. This demonstrates the efficiency of Consistent Hashing in handling server additions with minimal data migration.
Step 4: Handling Server Removal (Removing S2) Now, let's remove server S2 from the system and redistribute the data accordingly.
Hash Ring (after removing S2):
S1 -> 10, 30, 50
S3 -> 70, 90, 110
S4 -> 80, 100, 120
S5 -> 55, 75, 105
Data Distribution after removing S2:
S1 -> A, E
S5 -> B
S3 -> C
S4 -> D
The removal of S2 affected only the data items that were previously mapped to S2, which is E. It has now been redistributed to the neighbouring server S1. The rest of the data remains unchanged.
Advantages of Consistent Hashing
Disadvantages of Consistent Hashing
While there are no major disadvantages to using Consistent hashing, the following points are important to highlight, to establish that consistent hashing helps in reducing issues of data distribution, not eliminate them!
Thank you for reading! I’ll be posting weekly content on distributed systems & patterns, so please like, share and subscribe to this?newsletter ?for notifications of new posts.
Please comment on the post with your feedback, it will help me improve! :)
Until next time, Keep asking questions & Keep learning!
Senior Software Engineer at Booking.com | AWS Serverless Community Builder | pratikpandey.substack.com
1 年Subscribe to me on the following distributions - LinkedIn - https://www.dhirubhai.net/newsletters/system-design-patterns-6937319059256397824/ Medium - https://distributedsystemsmadeeasy.medium.com/subscribe Substack - https://pratikpandey.substack.com/
Lifestyle Disease Specialist | Mathematics Teacher
1 年Intriguing article on consistent hashing! Being from an engineering background, it's fascinating to see such efficient solutions in distributed systems. Can you share how it compares to other hashing techniques? Keep up the great work!
Engineering @ Oracle || x-Microsoft, VMware || Database Internals || Distributed systems || Build and scale data systems || Real-time Analytics || Big Data || ML and Deep learning |
1 年Excellent Article, Consistent hashing is mainly used for Load balancing between microservices.? Databases mostly use Hash partitioning or Key-range partitioning with a pre-defined fixed number of partitions irrespective of the number of nodes. Cassandra randomly chooses a fixed number of partitions and splits the Hash key ranges whenever a new Node is added or deleted. It uses dynamic partitioning similar to consistent hashing.
Founder @ Cosmocloud, Ex-LinkedIn, Angel Investor, MongoDB Champion, Book Author, Patent Holder (Distributed Algorithms)
1 年++