Data Partitioning and Sharding - From Scratch
Shrey Batra
Founder @ Cosmocloud ◆ Ex-LinkedIn ◆ Angel Investor ◆ MongoDB Champion ◆ Book Author ◆ Patent Holder (Distributed Algorithm)
Most of the applications we build, involves some type of database to be used, be it Relational (SQL based) or Non Relational (NoSQL based). Whenever we build something, it works perfectly fine for a small dataset, with limited Read and Write operations.
What happens when you scale your database for "too much data". What happens when your database if having very high Reads or Writes or both? In this article we will see what exactly is Horizontal Data Partitioning (or Sharding) and why is it important?
Partitioning, Sharding and scaling your systems is also very important for Cracking Interviews.
Horizontal Data Partitioning / Sharding
When the data in your database scales, there is a maximum upper limit that comes into play in terms of Storage and CPU you can have on a single machine. Even if you get the most expensive VM and servers, there will always be a limit to how much you can scale Vertically (scaling CPU / RAM / Storage). Due to this reason, and obviously to bring down the cost, we have something known as Data Partitioning which means we break down our single Table / Collection into smaller "chunks" or partitions and distribute them over multiple "smaller" machines.
So essentially, we break our single Table / Collection into "mutually exclusive" partitions, and then keeping these chunks on separate machines, they act like separate databases on their own machines. (Machine here can refer to VMs as well as container instances)
Mutual Exclusion means one record (or row) can exist only on one Partition / Shard.
We break this single database table/collection on some particular attribute such as User ID, Time, Region/Location, etc. This means records having same or similar values would be residing together on the same partition / shard / chunk.
Why do we do this?
We divide our database / tables into smaller chunks for the following reasons -
Load Distribution
A database is partitioned when it needs to handle more reads or writes than one over-scaled database. We often think that if we increase the size of VM or bring a more powerful machine, we can handle more reads and writes. Often, either the cost goes way up, or you touch the maximum capacity VM any hardware can build. Due to this, we now move to horizontal scaling / sharding and distribute the load across multiple nodes.
Scaling Query Performance
Let's say we have a table with 10 Million rows and we partition them on 10 different nodes, each having 1 Million rows. Now (assuming collection scan) instead of reading 10 Million records for each query by a single system, we have 10 different machines which just needs to read 1 Million rows. This allows up to quickly distribute the computation needed and get back results faster using "parallel computation".
Localised and Targeted Queries
Let's say we have 10 Millions rows containing some time related data spanning over last 1 months (equally distributed). Now, if our partitioning condition is based on timestamp, we can partition our mega table into (say) 10 shards, each containing 1 month's data.
领英推荐
Now, if we have queries requesting data only for a certain time range or timestamp, our database exactly knows which shard / partition to hit and only that node gets to perform the computation, not every node in the system. This helps us minimise the computation (CPU, RAM) needed and can quickly bring down system overhead if done correctly.
Scaling Writes and Higher Throughput
Now, many of you guys might know that in a traditional "Master-Replica" setup, we have 1 Master node and multiple Replicas. All Writes into the system are passed through the Master, and all Reads are distributed to replicas.
A Replica is a "copy" of database on another machine, used for high availability and failover purposes.
Taking this into account, lets shard / partition our Master node into (example) 10 shards and distribute them over different nodes. Now, each "master shard" owns up a "mutually exclusive" part of data and we can distribute our Writes on 10 different shards.
The system knows, which partition / shard does each record needs to be written to, and can target these Write operations to specific targeted shards. This helps the system to distribute our load over to multiple shards and decrease the overhead overall.
High Availability
As we have seen, the total data now resides over multiple nodes / machines in our system and not a single machine. So even if one node / shard / partition fails, we lose only a part of our data and not whole of our data. Our application and easily continue to work for the rest of our data records without any issue.
To improve this further, we can have "replicas" along with sharding/partitioning, which means every single shard would now have replicas. Given 1 master and 2 replica system, if we shard our master into 10 shards, we have a total of 30 shards in our system -
Now even if any of our primary (or master) shard fails, we have replica shards which will serve Read requests and can easily take over as Primary (or master) shard, thus increasing overall system availability.
Conclusion
Horizontal Data Partitioning / Sharding is a very important concept and is used in almost every production setup. Even if you have not worked directly with this yet, this is a very important topic for System Design Interviews. Every database have their own sharding techniques, principles and mechanism, which I will cover over a separate post.
Apart from this, if you are interested to learn how to crack System Design Interviews, check out my upcoming sessions here -?https://shreybatra.in/courses ?(if seats are full, reach out on DM)
Senior Software Engineer at Intuit
2 年Lets say we have 100GB of data and 2 physical machines(shards) If we split the total 100 GB data into 5 chunks (5gb, 25gb, 30gb, 30gb, 10gb) then can we distribute these 5 chunks over 2 shards? Or every shard will have equal distribution? Equal distribution means 50GB each.
Senior Software Engineer at Recur Club
2 年didn't understand "each containing 1 month's data." in Localised and Targeted Queries Means shouldn't it be 1 million in each shard again, and if that data is sorted by timestamp than most probably each shard will contain data of variable number of days?
Generative AI, LLM | Cloud | Data Engineering (Hands-On) Specialist / Leader & Advisor | Gold Medalist - Masters in Computer Applications (MCA)
2 年very well described