Sharding: Architecture Pattern

Pratik Pandey

Senior Software Engineer at Booking.com | AWS Serverless Community Builder | pratikpandey.substack.com

发布日期: 2023年7月19日

Scalability stands as a crucial tenet that underpins the design and development of systems, applications, and infrastructure. Scaling is a default in today’s world of distributed systems and while we can scale our services easily(assuming they’re stateless!), the same cannot be said for our stateful systems like data-stores. In this article, we’ll delve into one of the common ways to horizontally scale Stateful systems!

Sharding

Sharding is a technique used to horizontally partition a data-store into smaller, more manageable fragments called shards, which are distributed across multiple servers or nodes. This allows us to scale our data-stores not only in terms of storage, but also in terms of compute as the queries and operations on each node are only for a subset of the data i.e shard.

Sharding Techniques

The choice of sharding approach depends on factors such as the nature of the data, access patterns, scalability requirements, and the specific characteristics of the system. Here are some common sharding techniques:

Range-Based Sharding: Range-based sharding involves partitioning data based on a specific range of values within a chosen attribute. For example, data can be partitioned based on the range of customer IDs or timestamps. This approach allows for efficient querying of contiguous ranges of data but may lead to data skew if the distribution of values is uneven.
Hash-Based Sharding: Hash-based sharding involves applying a hash function to a selected attribute to determine the shard assignment for each data item. The hash function distributes data uniformly across shards, ensuring an even distribution of workload. This approach allows for easy scaling and load balancing but may result in random distribution and potentially increased cross-shard queries.
Composite Sharding: Composite sharding involves combining multiple sharding techniques to partition data. This approach is useful when a single sharding strategy may not be sufficient to handle the complexity or size of the data. For example, a composite sharding approach might involve range-based sharding based on a primary attribute and then using hash-based sharding within each range.
Geo Sharding: In geo sharding, the data is divided into shards based on geographic boundaries, such as countries, states, cities, or specific spatial regions. Each shard is responsible for storing and managing data associated with a particular geographic area.
Directory-Based Sharding: Directory-based sharding involves maintaining a directory or mapping table that associates data items or keys with their respective shards. The directory maps data to specific shards based on predefined rules or lookup tables. This approach provides flexibility in managing data placement and allows for dynamic reassignment of data but introduces additional lookup overhead.

No alt text provided for this image — Sharded System

Advantages of Sharding

Scalability: The primary reason why sharding is needed is to achieve horizontal scalability. As the volume of data and the number of users accessing a database increase, a single server may struggle to handle the load. Sharding allows us to distribute the data and workload across multiple servers, enabling parallel processing and improving overall performance.
Performance: Sharding can significantly enhance the performance of a database. By partitioning data and distributing it across multiple servers, read and write operations can be executed in parallel. This leads to reduced latency and improved response times, ensuring a seamless user experience even during peak load times.
Avoiding Complete Outage: In a sharded database setup, if one shard or server fails, the remaining shards can continue to serve data and keep the system operational. This limits the blast radius in case of any issues.
Availability: Most sharded setups use replication in conjunction with sharding, which means your data will be available on another server, increasing availability even in case of shard/server failures.

Complexities in Sharding

Instead of going into the disadvantages of Sharding, I feel it’s better if we cover the complexities associated with sharding. The reason is that I think sharding is need for large scale data stores, and hence it’s important that we realise the challenges that come with Sharding!

Rocky Bhatia 1 年前

Understanding Snowflake's Architecture: The Mailroom…

Adam Morton 2 周前

Distributed Systems: Exploring Architecture Styles

Huzaifa Asif 1 年前

Data Distribution:?Determining how to distribute data across shards can be complex. It requires careful consideration of factors such as data size, access patterns, and growth projections. Uneven data distribution can result in hotspots, where certain shards are overloaded with requests while others remain underutilised. This is where choosing the?Right Shard Key?is critical.
Shard Management:?Sharding introduces the complexity of managing and monitoring multiple shards. Adding or removing shards dynamically to accommodate changing workload patterns requires careful planning and execution. It involves tasks like data rebalancing, shard provisioning, and load balancing.
Query routing:?You either need a routing layer to route your queries to the right shard, or you need to make your applications aware to the shards(not really recommended). The routing layer introduces extra complexity to the system.
Data Integrity and Joins:?Maintaining data integrity and supporting join operations pose challenges in sharded databases. With data spread across multiple shards, enforcing referential integrity constraints or performing cross-shard joins becomes non-trivial. Ideally, you should avoid doing cross-shard operations as much as possible & defining your data models accordingly!

Overcoming Sharding Complexities

While sharding can be complicated, you can leverage the following tips to overcome the complexities with Sharding!

Careful Data Modelling:?Thoroughly understanding the data and its access patterns is essential for effective sharding. Properly analyzing and modelling the data can help determine the most suitable partitioning strategy, ensuring an even distribution of data across shards.
Advanced Data Distribution Algorithms:?Employing sophisticated algorithms like consistent hashing or range partitioning can help achieve balanced data distribution. This allows you to scale out your cluster with minimal data movement!
Monitoring and Automation:?Implementing robust monitoring tools and automation systems can simplify shard management tasks. These tools can provide insights into shard performance, identify bottlenecks, and automate routine operations like shard provisioning and rebalancing.

Choosing the Right Shard Key

The shard key determines how data is partitioned and distributed across shards. The choice of shard key can significantly impact the performance, scalability, and efficiency of the sharded system. Here are some considerations to help choose the right shard key:

Cardinality: The ideal shard key should have high cardinality, meaning it should have a large number of unique values. A shard key with low cardinality may result in data imbalance, where some shards receive significantly more data than others. High cardinality allows for even data distribution and balanced workloads across shards.
Data Distribution: Analyze the data access patterns and distribution characteristics of the dataset. The shard key should align with the natural data distribution to achieve an even distribution of data across shards. Consider the properties of the data that are frequently accessed together and ensure they are colocated within the same shard.
Query Patterns: Understand the common types of queries performed on the data. The shard key should align with the query patterns to minimise cross-shard queries. If a specific attribute or range of values is frequently used in queries, it may be a good candidate for the shard key to enable localised querying.

Sharding presents a powerful solution to tackle the scalability limitations of traditional databases. By distributing data across multiple shards, it enables improved performance, availability, and fault tolerance. However, sharding also introduces complexities in data distribution, consistency, and management. Through careful planning, advanced algorithms, and the use of appropriate tools and technologies, these challenges can be overcome, allowing you to build a scalable, performant data-store.

Thank you for reading! I’ll be posting weekly content on distributed systems & patterns, so please like, share and subscribe to this?newsletter?for notifications of new posts.

Please comment on the post with your feedback, it will help me improve! :)

Until next time, Keep asking questions & Keep learning!

Distributed Systems Made Easy

7,879 位关注者

Roopali Neeraj

Software Development Engineer II at Amazon

1 年

Very interesting read! What can be used for a Shard Query Router? Is DynamoDB good for maintaining such a partition-map?

1 次回应

Pratik Pandey

Senior Software Engineer at Booking.com | AWS Serverless Community Builder | pratikpandey.substack.com

1 年

Subscribe to my LinkedIn newsletter to get updates on any new System design posts - https://www.dhirubhai.net/newsletters/system-design-patterns-6937319059256397824/ You can follow me on - Medium - https://distributedsystemsmadeeasy.medium.com/subscribe Substack - https://pratikpandey.substack.com/

Balaji Kalyansundaram

Building Better Home: A one-stop platform to buy materials to build or renovate your home. Transforming the shopping experience of original products of reputed brands at wholesale prices. Raising funds.

1 年

Well written Pratik Pandey . "Blast radius" is a new term I learnt tonight.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Sharding: Architecture Pattern

Pratik Pandey

Senior Software Engineer at Booking.com | AWS Serverless Community Builder | pratikpandey.substack.com

Sharding

Sharding Techniques

Advantages of Sharding

Complexities in Sharding

领英推荐

Overcoming Sharding Complexities

Choosing the Right Shard Key

Distributed Systems Made Easy

7,879 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Consistent Hashing: Architecture Pattern

Architecture Weekly #166 - 12th February 2024

Building Blocks of Tech Brilliance: A Deep Dive into System Design Essentials

Software Architecture: Space-Based Architecture Pattern

Kubernetes Architecture

Event-driven architectures vs event-sourcing patterns

The Shift from an App-Centric to Data-Centric Architecture | G2

eShopOnWeb Architecture (8/16) – uses in memory caching to avoid sending unnecessary queries to the DB

Choosing between Data and Event-Driven Architecture

Understanding Multitier Architecture

Sharding

Sharding Techniques

Advantages of Sharding

Complexities in Sharding

领英推荐

Overcoming Sharding Complexities

Choosing the Right Shard Key

Distributed Systems Made Easy

7,879 位关注者

Database Intermediate Series: Change Data Capture(II)

2024年5月29日

Database Intermediate Series: Change Data Capture(I)

2024年4月23日

Database Intermediate Series: SQL Isolation Levels Internals

2024年4月4日

Database Basics Series: Understanding SQL Isolation Levels

2024年3月21日

Go Concurrency Series: Concurrency Patterns(II)

2024年2月3日

Go Concurrency Series: Concurrency Patterns

2024年1月23日

Go Concurrency Series: Deep Dive into Go Scheduler(III)

2024年1月20日

Go Concurrency Series: Deep Dive into Go Scheduler(II)

2024年1月14日

Go Concurrency Series: Deep Dive into Go Scheduler(I)

2024年1月4日

Go Concurrency Series: Introduction to Goroutines

2023年12月25日

社区洞察

其他会员也浏览了

Consistent Hashing: Architecture Pattern

Architecture Weekly #166 - 12th February 2024

Building Blocks of Tech Brilliance: A Deep Dive into System Design Essentials

Software Architecture: Space-Based Architecture Pattern

Kubernetes Architecture

Event-driven architectures vs event-sourcing patterns

The Shift from an App-Centric to Data-Centric Architecture | G2

eShopOnWeb Architecture (8/16) – uses in memory caching to avoid sending unnecessary queries to the DB

Choosing between Data and Event-Driven Architecture

Understanding Multitier Architecture