Architecture Talk-1: MongoDB - Sharding Architecture
Karthikeyan Thanikachalam
Aspiring Head of Data & AI Platform | Author | Generative AI Evangelist| Senior Data Architect | Cloud Migration Specialist | Cloud Certified Professional - 5x | Teradata Vantage | GCP | Azure | AWS | GenAI | AI & ML
In MongoDB, a sharded cluster consists of:
A shard is a replica set that contains a subset of the cluster’s data.
The mongos acts as a query router for client applications, handling both read and write operations. It dispatches client requests to the relevant shards and aggregates the result from shards into a consistent client response. Clients connect to a mongos, not to individual shards.
Config servers are the authoritative source of sharding metadata. The sharding metadata reflects the state and organization of the sharded data. The metadata contains the list of sharded collections, routing information, etc.
In its simplest configuration (a single shard), a sharded cluster will look like this:
Sharding Benefits
Sharding allows you to scale your database to handle increased loads to a nearly unlimited degree. It does this by increasing read/write throughput, and storage capacity. Let’s look at each of those in a little more detail:
Data Distribution
领英推荐
Shard Key
MongoDB shards at the collection level. You choose which collection(s) you want to shard. MongoDB uses the shard key to distribute a collection’s documents across shards. MongoDB splits the data into “chunks”, by dividing the span of shard key values into non-overlapping ranges. MongoDB then attempts to distribute those chunks evenly among the shards in the cluster.
Shard keys are based on fields inside each document. The values in those fields will decide on which shard the document will reside, according to the shard ranges and amount of chunks. This data is stored and kept in the config server replica set.
The shard key has a direct impact on the cluster’s performance and should be chosen carefully. A suboptimal shard key can lead to performance or scaling issues due to uneven chunk distribution. You can always change your data distribution strategy by changing your shard key. Use the following documentation to choose the best shard key for you.
A background process known as the “balancer” automatically migrates chunks across the shards to ensure that each shard always has the same number of chunks.
Sharding Strategy
MongoDB supports two sharding strategies for distributing data across sharded clusters:
Ranged sharding divides data into ranges based on the shard key values. Each chunk is then assigned a range based on the shard key values.
A range of shard keys whose values are “close” are more likely to reside on the same chunk. This allows for targeted operations as a mongos can route the operations to only the shards that contain the required data.
Hashed Sharding involves computing a hash of the shard key field’s value. Each chunk is then assigned a range based on the hashed shard key values.
While a range of shard keys may be “close”, their hashed values are unlikely to be on the same chunk. Data distribution based on hashed values facilitates more even data distribution, especially in data sets where the shard key changes monotonically. However, hashed sharding does not provide efficient range-based operations.