登录查看更多内容

Consistent Hashing: Architecture Pattern

Pratik Pandey

Senior Software Engineer at Booking.com | AWS Serverless Community Builder | pratikpandey.substack.com

发布日期: 2023年8月2日

Imagine you are building a shared distributed system that needs to store and retrieve data across multiple servers. One of the main challenges is deciding which server should be responsible for storing a given piece of data. Traditional approaches like modulo-based hashing often lead to inefficient data distribution and expensive rebalancing when the number of servers changes.

Problem with Modulo-Based Hashing

Let's illustrate the problems with modulo-based hashing using a simple example of a distributed system with four servers (S1, S2, S3, and S4) and a dataset of five items (A, B, C, D, and E). We'll use a modulo-based hashing technique to assign data items to servers based on their hash values.

Step 1: Calculate Hash Values Assuming our hash function assigns the following hash values to the data items:

A -> hash(A) = 11 B -> hash(B) = 20 C -> hash(C) = 32 D -> hash(D) = 45 E -> hash(E) = 50

Step 2: Assign Data Items to Servers Now, we distribute the data items to servers using the modulo-based hashing technique. Let's say we use the modulo operator (%) to determine the server for each data item:

A -> S1 (hash(A) % 4 = 11 % 4 = 3) 
B -> S2 (hash(B) % 4 = 20 % 4 = 0) 
C -> S2 (hash(C) % 4 = 32 % 4 = 0) 
D -> S3 (hash(D) % 4 = 45 % 4 = 1) 
E -> S4 (hash(E) % 4 = 50 % 4 = 2)

Data Distribution:

S1: A S2: B,C S3: D S4: E

Problem 1: Inefficient Data Redistribution

Now, let's assume we add a new server, S5, to the system. According to the modulo-based hashing, the data distribution will change as follows:

A -> S1 (hash(A) % 5 = 11 % 5 = 1) 
B -> S2 (hash(B) % 5 = 20 % 5 = 0) 
C -> S4 (hash(C) % 5 = 32 % 5 = 2) 
D -> S2 (hash(D) % 5 = 45 % 5 = 0) 
E -> S2 (hash(E) % 5 = 50 % 5 = 0)

Updated Data Distribution:

S1: A S2: B,D,E S4: C S3: S5:

As we can see, there is an inefficient data distribution. Ideally, we’d not have wanted data movement from servers that are not overloaded(anything except S2). However, we see that the new server wasn’t utilized at all and data was also moved away from S2. This process becomes even more problematic as the system scales and more servers are added, leading to inefficiency and resource waste.

Problem 2: Imbalanced Load and Hotspots

Modulo-based hashing uses the remainder of the hash value divided by the total number of servers to determine the server responsible for storing a data item. This can lead to uneven data distribution, especially when the number of servers is not a power of 2 or is not evenly divisible by the number of data items. Certain servers may become overloaded with data, while others remain underutilized, resulting in poor load balancing. You could see the same in the example mentioned above.

The modulo-based hashing approach has evident problems when it comes to dynamic distributed systems. To overcome these problems, Consistent Hashing provides a better solution, ensuring better data distribution, load balancing, and fault tolerance in distributed systems.

Consistent Hashing

Consistent Hashing is a distributed hashing technique that minimizes data redistribution when the number of servers changes. The idea is to map both data and servers onto a circle, which is represented as a hash ring.

Each server is associated with one or more points on the ring, and data is placed on the ring based on its hash value. To find the server responsible for a specific piece of data, one simply needs to find the closest server in the clockwise direction on the ring.

Handling Server Additions and Removals

Step 1: Setting up the Initial Configuration We start with four servers (S1, S2, S3, and S4) and distribute the data items (A, B, C, D, and E) using Consistent Hashing.

Step 2: Assign Data Items to Servers using Consistent Hashing We use a hash function to map the data items and servers onto a hash ring, and each server is associated with one or more points on the ring. Mapping servers to multiple points in the ring(Virtual Nodes) reduces the chances of a skewed data distribution in your servers.

Hash Ring:

S1 -> 10, 30, 50 
S2 -> 20, 40, 60 
S3 -> 70, 90, 110 
S4 -> 80, 100, 120

No alt text provided for this image — Hash Ring

Rocky Bhatia 1 年前

Essential Guidelines for Effective System Design

Momen Negm 1 个月前

Understanding Kafka System Design: Diving into Kafka…

Lavakumar Thatisetti 8 个月前

Data Items:

A -> hash(A) = 25 
B -> hash(B) = 55 
C -> hash(C) = 85 
D -> hash(D) = 115 
E -> hash(E) = 35

Data Distribution:

S1 -> A 
S2 -> B, E
S3 -> C 
S4 -> D

Step 3: Handling Server Addition (Adding S5) Let's add a new server, S5, to the system. To distribute the data efficiently, we place S5 on the hash ring as well.

Hash Ring (with S5):

S1 -> 10, 30, 50 
S2 -> 20, 40, 60 
S3 -> 70, 90, 110 
S4 -> 80, 100, 120 
S5 -> 55, 75, 105

Data Distribution after adding S5:

S1 -> A
S2 -> E
S3 -> C 
S4 -> D
S5 -> B

As we can see, the addition of S5 affected only data item B, which is now mapped to S5. The rest of the data remains unaffected. This demonstrates the efficiency of Consistent Hashing in handling server additions with minimal data migration.

Step 4: Handling Server Removal (Removing S2) Now, let's remove server S2 from the system and redistribute the data accordingly.

Hash Ring (after removing S2):

S1 -> 10, 30, 50 
S3 -> 70, 90, 110 
S4 -> 80, 100, 120 
S5 -> 55, 75, 105

Data Distribution after removing S2:

S1 -> A, E 
S5 -> B
S3 -> C 
S4 -> D

The removal of S2 affected only the data items that were previously mapped to S2, which is E. It has now been redistributed to the neighbouring server S1. The rest of the data remains unchanged.

Advantages of Consistent Hashing

Load Balancing: With Consistent Hashing, data distribution across servers tends to be balanced. When a new server is added or an existing one is removed, only a small fraction of the data needs to be moved, making the system highly scalable and efficient.
Fault Tolerance: In Consistent Hashing, the loss of a server affects only the data mapped to that server's location on the ring. The rest of the data remains unaffected, and the system can gracefully handle the failure without causing a complete disruption.
Easy to Implement: The concept of Consistent Hashing is relatively simple to understand and implement. It has become a fundamental technique used in many distributed systems like distributed caching, content delivery networks (CDNs), peer-to-peer networks, and load balancers.

Disadvantages of Consistent Hashing

While there are no major disadvantages to using Consistent hashing, the following points are important to highlight, to establish that consistent hashing helps in reducing issues of data distribution, not eliminate them!

Overhead for Virtual Nodes: Virtual nodes (replicas) are often used in Consistent Hashing to achieve better data distribution. These virtual nodes introduce additional overhead in terms of memory and computational resources, as each physical server needs to manage multiple virtual nodes.
Inefficient for Small Datasets: In scenarios where the dataset is relatively small or the number of servers is limited, Consistent Hashing may not provide significant advantages over simpler hashing techniques like modulo-based hashing. The overhead introduced by Consistent Hashing could outweigh its benefits for small-scale systems.
Scalability with High Churn: In dynamic systems with frequent node additions and removals (high churn rate), Consistent Hashing may experience increased overhead due to the need for frequent data redistribution. High churn can impact system stability and performance, requiring careful management.
Inability to Consider Node Capacity: Consistent Hashing primarily focuses on balancing data distribution among nodes. However, it doesn't take into account the varying capacity of individual nodes. In practical scenarios, nodes may have different hardware capabilities or storage capacities, and Consistent Hashing may not fully optimize resource utilization in such cases. You might prefer homogeneous nodes when working with Consistent Hashing!

Thank you for reading! I’ll be posting weekly content on distributed systems & patterns, so please like, share and subscribe to this?newsletter ?for notifications of new posts.

Please comment on the post with your feedback, it will help me improve! :)

Until next time, Keep asking questions & Keep learning!

Distributed Systems Made Easy

7,916 位关注者

Pratik Pandey

Senior Software Engineer at Booking.com | AWS Serverless Community Builder | pratikpandey.substack.com

1 年

Subscribe to me on the following distributions - LinkedIn - https://www.dhirubhai.net/newsletters/system-design-patterns-6937319059256397824/ Medium - https://distributedsystemsmadeeasy.medium.com/subscribe Substack - https://pratikpandey.substack.com/

Vishnu Walke

Lifestyle Disease Specialist | Mathematics Teacher

1 年

Intriguing article on consistent hashing! Being from an engineering background, it's fascinating to see such efficient solutions in distributed systems. Can you share how it compares to other hashing techniques? Keep up the great work!

1 次回应

Sri Chavali

Engineering @ Oracle || x-Microsoft, VMware || Database Internals || Distributed systems || Build and scale data systems || Real-time Analytics || Big Data || ML and Deep learning |

1 年

Excellent Article, Consistent hashing is mainly used for Load balancing between microservices.? Databases mostly use Hash partitioning or Key-range partitioning with a pre-defined fixed number of partitions irrespective of the number of nodes. Cassandra randomly chooses a fixed number of partitions and splits the Hash key ranges whenever a new Node is added or deleted. It uses dynamic partitioning similar to consistent hashing.

1 次回应

Shrey Batra

Founder @ Cosmocloud, Ex-LinkedIn, Angel Investor, MongoDB Champion, Book Author, Patent Holder (Distributed Algorithms)

1 年

1 次回应

查看更多评论

要查看或添加评论，请登录

Pratik Pandey的更多文章

Database Intermediate Series: Change Data Capture(II)

2024年5月29日

Database Intermediate Series: Change Data Capture(II)

Our previous post discussed Change Data Capture and how to implement it using triggers. In this post, we’ll explore how…

1 条评论
Database Intermediate Series: Change Data Capture(I)

2024年4月23日

Database Intermediate Series: Change Data Capture(I)

Change Data Capture (CDC) refers to identifying and capturing changes made to data in a database and then delivering…

2 条评论
Database Intermediate Series: SQL Isolation Levels Internals

2024年4月4日

Database Intermediate Series: SQL Isolation Levels Internals

In our last post, we talked about Database Isolation Levels and how different Isolation Levels allow us to balance the…

1 条评论
Database Basics Series: Understanding SQL Isolation Levels

2024年3月21日

Database Basics Series: Understanding SQL Isolation Levels

We are starting a new series on Databases, covering Basic, Intermediate, and Advanced concepts. This is the first…

6 条评论
Go Concurrency Series: Concurrency Patterns(II)

2024年2月3日

Go Concurrency Series: Concurrency Patterns(II)

In our last post, we talked about the Worker Pool and Pipeline concurrency patterns, that we can use while designing…

1 条评论
Go Concurrency Series: Concurrency Patterns

2024年1月23日

Go Concurrency Series: Concurrency Patterns

Let’s continue being a little more hands-on in our Go Concurrency Series! In this post, we’ll look into the…

1 条评论
Go Concurrency Series: Deep Dive into Go Scheduler(III)

2024年1月20日

Go Concurrency Series: Deep Dive into Go Scheduler(III)

In my previous posts in the Go Concurrency Series, I’ve gone into the different components of the Go Scheduler and…
Go Concurrency Series: Deep Dive into Go Scheduler(II)

2024年1月14日

Go Concurrency Series: Deep Dive into Go Scheduler(II)

In my last post, we covered the components inside the Go Scheduler, and how a Go Scheduler can orchestrate the…

1 条评论
Go Concurrency Series: Deep Dive into Go Scheduler(I)

2024年1月4日

Go Concurrency Series: Deep Dive into Go Scheduler(I)

In my last post about Goroutines, we talked about how Goroutines differ from Traditional threads. The Go Runtime…

6 条评论
Go Concurrency Series: Introduction to Goroutines

2023年12月25日

Go Concurrency Series: Introduction to Goroutines

Concurrency is a fundamental concept in modern software development, enabling programs to handle multiple tasks…

4 条评论

See all articles

Consistent Hashing: Architecture Pattern

Pratik Pandey

Senior Software Engineer at Booking.com | AWS Serverless Community Builder | pratikpandey.substack.com

Problem with Modulo-Based Hashing

Problem 1: Inefficient Data Redistribution

Problem 2: Imbalanced Load and Hotspots

Consistent Hashing

Handling Server Additions and Removals

领英推荐

Advantages of Consistent Hashing

Disadvantages of Consistent Hashing

Distributed Systems Made Easy

7,916 位关注者

Pratik Pandey的更多文章

社区洞察

其他会员也浏览了

Server-Sent Events Using Spring WebFlux and Reactive Kafka

Understanding Snowflake's Architecture: The Mailroom Analogy

Sharding: Architecture Pattern

15 System Design Core Concepts a complete crash course approach.

Embracing Event-Driven Architectures: The Future of Data Engineering

Medallion Architecture framework within the Microsoft Fabric (Bronze Layer) - Part 1

Software Architecture: Space-Based Architecture Pattern

Bulkhead Architecture Pattern: Data Security & Governance

Space-Based Architecture: Resolving Data Consistency, Performance, and Scalability Challenges in Distributed Systems

eShopOnWeb Architecture (8/16) – uses in memory caching to avoid sending unnecessary queries to the DB

Problem with Modulo-Based Hashing

Problem 1: Inefficient Data Redistribution

Problem 2: Imbalanced Load and Hotspots

Consistent Hashing

Handling Server Additions and Removals

领英推荐

Advantages of Consistent Hashing

Disadvantages of Consistent Hashing

Distributed Systems Made Easy

7,916 位关注者

Pratik Pandey的更多文章

Database Intermediate Series: Change Data Capture(II)

Database Intermediate Series: Change Data Capture(I)

Database Intermediate Series: SQL Isolation Levels Internals

Database Basics Series: Understanding SQL Isolation Levels

Go Concurrency Series: Concurrency Patterns(II)

Go Concurrency Series: Concurrency Patterns

Go Concurrency Series: Deep Dive into Go Scheduler(III)

Go Concurrency Series: Deep Dive into Go Scheduler(II)

Go Concurrency Series: Deep Dive into Go Scheduler(I)

Go Concurrency Series: Introduction to Goroutines

社区洞察

其他会员也浏览了

Server-Sent Events Using Spring WebFlux and Reactive Kafka

Understanding Snowflake's Architecture: The Mailroom Analogy

Sharding: Architecture Pattern

15 System Design Core Concepts a complete crash course approach.

Embracing Event-Driven Architectures: The Future of Data Engineering

Medallion Architecture framework within the Microsoft Fabric (Bronze Layer) - Part 1

Software Architecture: Space-Based Architecture Pattern

Bulkhead Architecture Pattern: Data Security & Governance

Space-Based Architecture: Resolving Data Consistency, Performance, and Scalability Challenges in Distributed Systems

eShopOnWeb Architecture (8/16) – uses in memory caching to avoid sending unnecessary queries to the DB