Horizontal Partitioning, Scaling and Sharding - with Real Examples

Horizontal Partitioning, Scaling and Sharding - with Real Examples

Everyone has heard about what horizontal partitioning is and what is sharding, but its still a problem people don't really understand properly. Most of the people think sharding and replication help scalability. Probably true, but how? Let's take a deep look into it.

Your Likes for this article would go a long way for me to reach out new folks..! ??

What is Horizontal Partitioning?

Breaking something down into smaller chunks of the same entity, is called horizontal partitioning.

No alt text provided for this image

Each partition are responsible for a subset of the data being processed or stored by the main entity. Let's look at some examples -

Horizontally Partitioning an SQL Table

Be it MySQL or PostgreSQL, in SQL based databases, we have tables..! To partition each table (a single entity) we break it down into multiple smaller tables and disburse them in separate running instances/machines.

Let's say our main table had information of 100 employees, each with ID from 1-100. Let's partition (break) our table based on IDs into 10 equal partitions - from 1-10, 11-20, 21-30 and so on...

Horizontally Partitioning a Kafka Topic

Kafka, along with any other Queue like system, can also partition its queue (a Kafka Topic). Rather than asking every producer to push to the same queue, and every consumer to read from the same queue, we can partition (break) the topic/queue into multiple smaller topics.

Just an FYI, a topic behaves differently from a queue, but just taking it into a similar anology for now.

Now, how does Partitioning in Kafka works? There is a Hash Function -- which computes where does each message in Kafka go, in which partition? Let's say we partition based on Msg Key -- an identifier based on producer ID. Now if we have 100 producers, we can equally distribute all producers to let's say, 5 smaller queues/topics or partitions.

Each partition can have a separate consumer, which can help speed up consumption because of parallelisation.

Horizontally Partitioning a Backend API Service

Let's say you have 1 instance of your API server running and all 1 Million requests goes through that same instance. Think about the load that server would be facing..!!

No alt text provided for this image

Now, how can we distribute this load of requests? We can partition our requests and spin up 100 instances of API servers where each instance handles only 10,000 requests now.

Basically, now you should understand what is horizontal partitioning.

What is Sharding?

Sharding is similar to horizontal partitioning of data, but makes sure that that each partition is actually having a separate CPU and Memory allocated to it, as well as it can live as a separate instance of the larger entity.

No alt text provided for this image

That means, if we break the table into multiple smaller tables, each table is placed on a separate database server altogether, be it on the same machine (containerised) or on separate machines (always preferable).

So what is Horizontal Scaling?

As you might have guessed, increasing or decreasing the number of partitions/shards in your system is known as horizontal scaling..! Remember, the whole data of one partition is to be transferred over to another machine over network, so these operations could be very costly..!

How is Replication different?

Replication should not be talked about when you are talking about Scaling (other than very limited scenarios). As Paritioning/Sharding are related to scalaing, replication is related to availability. It is about making copies of original data and spreading it on different machines, so that if the main instance goes down, a copy can takeover as the primary instance.

No alt text provided for this image

Can you have both Sharding and Replication?

Yes, why not? You can replicate the larger table, as well as smaller tables..! Copying data is independent of partitioning logic and should be kept so (in most general cases).

How does Replication help in Scalability then?

I said other than very limited scenarios above, because replication can help scale the system in some cases. Think about read heavy scenarios in MySQL where you have your replicas standing still and wasting cost. Why not send your reads over to your replica for spreading load? If you have 3 replicas of each shard/partition/table, you are spreading your read load as 1/3 of original..!!

I'll be hosting a dedicated Live Session to make you understand how to use Sharding and Replication in your System Design interviews..!! For that, join the waitlist on Eazy Develop now..!

Conclusion

You should understand the main different between Horizontal Partitioning, Sharding and Replication..!! We'll talk about Vertical Partitioning later on..!

If you like this short, sweet story - do Like and Share this article on LinkedIn ??
Sakalya Deshpande

Senior Software Engineer | Machine Learning Infrastructure | Scalability | Distributed Systems | Site Reliability Engineering | Platform Engineering | Rust | Python | Ex-Nvidia | Ex-Tiktok

2 年

Most of the times "Partition" and "Shard" are used interchangeably. Each data store has its own naming convention for it. For example, MongoDb, ElasticSearch use "Shard", Cassandra uses "Vnode", Hbase uses "Region" and so on.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了