Boosting Database Efficiency: The Power of Partitioning and Sharding

Boosting Database Efficiency: The Power of Partitioning and Sharding

Introduction

Partitioning and sharding are essential techniques for managing large databases. They address the need for improved performance, scalability, and availability by distributing data across multiple servers or splitting large tables into smaller, more manageable units.

Why Partitioning and Sharding?

1. Large Databases: Distributing extensive datasets across multiple servers enhances performance by allowing queries to target smaller datasets.

2. Large Tables: Techniques are needed to enhance query performance on large tables by scanning smaller segments.

3. Wide Tables: Some tables have numerous columns, and not all are accessed in every query. We need methods to optimize these queries.

4. Scalability and Availability: Enhancing the scalability, availability, and performance of the database is crucial.

Techniques for Improvement

To address these needs, options include:

  • Indexing: Improves query performance.
  • CQRS (Command Query Responsibility Segregation): Splits read and write operations.
  • Partitioning and Sharding: Distributing data for optimization.

Single server partitioning

Single server partitioning: in this technique the large tables is split into small ones on same server. And these can be done by split tables horizontally by split rows into multiple tables, each of them have small volume of data or split tables vertically by split tables with many columns into ones have small number of columns. And Some databases support this technique internally like postgreSQL.


Advantages

  • Performance: Enhances query execution time by limiting the scanned dataset.
  • Simplicity: Easier to implement.

Disadvantages

  • Scalability: Limited by the maximum resources of the server (vertical scaling).
  • Single point failure: all dataset in one server, this make this server is single point of failure, compared to multiple servers partitioning which distribute dataset across multiple servers and if one server is down, so this part of data is not accessible but the remaining system is working correctly.

Multiple Servers Partitioning

This technique distributes a large database across multiple servers, with each server holding a part of the database.


Advantages

  • Scalability: to scale the database, more servers will be added (horizontal scaling), the scaling is not limited by max server resources like single server partitioning).
  • Performance: enhance queries execution due to scanning small dataset and the load of queries is distributed across all servers.

Disadvantages

  • Complexity: Increased complexity in application design.
  • Consistency: Maintaining data consistency is challenging.
  • Latency: Accessing multiple servers can increase latency.

Horizontal Partitioning

Data is partitioned at the row level, with each group of rows referred to as a partition. This can occur on a single server or across multiple servers.


Types of horizontal partitioning?

  • Range partitioning: the data is partitioned based on range of values.

Example?

In chat application with large number of messages, and most of users access most recent messages, so we can choose sent_at (date) column to be used in partitioning and choose one month as partitioning range. So according to sent_at, the message will be redirected to specific partition.

  • Hash partitioning: the data is partitioned randomly based on hash function?
  • List partitioning: the data is partitioned based on list of values.

Example?

In an application like Uber with large number of users and drivers in many countries, the list partitioning can be used to make users are partitioned based on their countries.?

Vertical partitioning

Vertical partitioning: in this technique large tables with many columns are partitioned in small tables with partitioned columns according to the accessing queries and which columns will be accessed. This technique communally used in single server.


Sharding?

it is type of multiple servers partitioning technique, in which the data is partitioned horizontally and distributed across servers and each server have same database scheme. Some of database support sharding natively for example Mongodb.

Challenges in implementing sharding?

  1. the sharding key should be chosen carefully because this key will be used to distribute data on servers. And if the chosen key is wrong or ineffective, the related data will be on different servers and that make joining data is difficult and affect the performance. So we should choose sharding key carefully to make the related data or most of them on same server for example if customer id is chosen, all data of this customer should be on same server. Another example of sharding key is distributing data based on geo location.
  2. another effect of choosing wrong sharding key is volume of data in each server. if some servers have large volume of data and others have small volume of data that affect on performance of queries due to large number of queries redirected to servers contains large volume of data and large dataset will be scanned. If this case occured, we should rebalance data across servers and this is complex and have high cost. So the sharding key should be chosen carefully.
  3. the routing logic must be considered, this logic may be implemented in application or using external tools like Spock Proxy for MySQL
  4. handling distributed transaction and grantee consistency of data between servers.

Conclusion

Partitioning and sharding enhance database performance, availability, and scalability. However, they also introduce additional complexity. It is essential to weigh the benefits against the complexities and explore other performance-enhancing techniques before implementation.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了