登录查看更多内容

Large datasets, slow queries, now what?

Bassel Haidar

| AI | Data | Automation |

发布日期: 2023年5月3日

Background

With unprecedented, unrelenting data growth, large-scale applications are more prevalent than ever. With this accelerated data growth, the challenge of executing queries on large datasets becomes more complex. Queries take longer to run and, in many cases, crawl to a halt, resulting in a poor user experience. To address this issue, data partitioning is a common approach used in database management systems.

Data partitioning is a process of dividing a large dataset into smaller logical subsets known as partitions. The main advantage of data partitioning is that it provides a way to quickly access and process large datasets. Data sharding is an advanced form of data partitioning where larger datasets are split into multiple shards and distributed across multiple nodes within a cluster or network. Shards are usually formed from specific ranges of values from the dataset. Data sharding allows for faster performance, scalability, and better use of resources with minimum storage costs. They also provide better fault tolerance than plain data partitioning. If one node fails, only one or two shards will be affected rather than entire partitions.

With data partitioning, databases can handle increasing amounts of data without sacrificing performance or getting bogged down by slow processing times. In today's digital world, where data is a critical component of many businesses and organizations, this technique plays a crucial role in ensuring databases can meet the heightened demands of the modern world.

CAP Theorem

The CAP Theorem is a concept in distributed computing that states that it is impossible for a distributed system to simultaneously guarantee all three of the following properties: consistency, availability, and partition tolerance. The theorem states that a distributed system can guarantee at most two properties at any given time.

In the absence of partition, one can achieve consistency and availability, but in the event of network partition failure, one must decide between consistency and availability.

Note: The CAP Theorem has nothing to do with ACID database transactions

Consistency refers to the requirement that all nodes in a distributed system see the same data at the same time. Availability refers to the requirement that every request to the system receives a response, without guarantee that it contains the most recent version of the data. Partition tolerance refers to the requirement that the system continues to function even when communication between nodes is lost.

Understanding the CAP Theorem and the trade-offs it represents is essential for making informed decisions about the design and implementation of a partitioned or sharded database. For example, if the highest priority is consistency then the system may sacrifice some level of availability or partition tolerance.

Partitioning Types

Horizontal Partitioning

It distributes data across multiple databases and servers. It stores rows of a table on multiple database clusters. To implement a sharded database architecture, you must partition data horizontally and distribute data partitions across database shards. You can use various strategies to partition a table (i.e., list partitioning, range partitioning, or hash partitioning). Each shard contains a subset of the entire data set. This makes sharded databases more resilient to outages. The sharded database can replicate its backup shards on additional nodes, minimizing outages impact. When multiple tables are involved and bounded by foreign key relationships (Employee ID), you can achieve horizontal partitioning by using the same partition key (Shard ID) on all tables involved. Data spans across tables but belongs to one partition key distributed to one database shard. The following diagram shows an example of horizontal partitioning with a classic example of employees expense reports (Employee and Expense tables).

Vertical Partitioning

Its most common use is to reduce I/O and performance costs associated with fetching items that are frequently accessed. It stores tables and/or columns in a separate database or tables. The difference between horizontal and vertical partitioning is how they manage the data. With horizontal partitioning, you manage the data in a unified way, while with vertical partitioning, your data is organized by columns and stored in separate databases. Vertical partitioning works on a domain-specific level, and the code routes reads to the appropriate database. This is particularly useful for column-oriented data stores (i.e., Cassandra, Druid, HBase).

Partitioning Techniques

Range partitioning is a type of data partitioning technique used to divide large sets of data into smaller, more manageable chunks. It works by splitting the dataset based on specific ranges of values for certain columns within the dataset. This ensures that all records sharing the same range of values will be grouped together and stored in the same partition. Range partitioning can help improve query efficiency as well as reduce overhead time and storage costs, since each logical subset of data will now have its own dedicated resources.
List partitioning categorizes records based on the presence of certain values within specific columns. This means that records matching specific values will be grouped together and stored in the same partition. This method of dividing data not only helps improve query performance but also helps reduce storage costs since each distinct set of data now has its own dedicated resources.
Functional sharding divides a dataset into multiple databases based on the function that each database needs to perform. For example, an e-commerce platform might choose to functional shard by defining one database for product information, another for customer data, and yet another for sales and financials. This allows the platform to separate the data into different databases that can be optimized for their specific uses.?
Directory-based sharding stores metadata about where each piece of data is located so that queries can be routed accordingly. It assigns each record an identifier that links it with a specific database or table in a directory structure. Directory-based sharding can be useful if you have large datasets that cannot easily be split into horizontal or vertical shards.?
Range-based sharding takes a range of values and assigns them to shards. This range is determined through best practices such as distributed range keys, which identify weakly consistent boundaries for range-based sharding. Applications can access data based on predefined range filters. It was first introduced with Google's BigTable, and since then has become an important part of modern distributed database technology.
Hash-based sharding

A hash function assigns values to each shard. The values are then sorted based on the hash value. There are two types of hashing:

Static Hashing: It allows users to perform lookups on a finalized dictionary set. All objects in the dictionary are final and do not change. The data bucket address is always the same, making it less efficient than its dynamic hashing counterpart.
Dynamic Hashing: Data buckets are added and removed dynamically and on-demand. It promotes efficiency, but dynamic shard keys may increase the workload of update operations, leading to performance degradation.

Hash-based sharding is a method of range-based data partitioning. It distributes an application's data across multiple nodes or databases based on the output of a hash function, allowing for the distribution of data range units across the cluster. Compared to range-based sharding, hash-based sharding provides better scalability and can reduce complex overhead when it comes to computing range boundaries as range units grow larger. Additionally, hash-based sharding reduces the amount of possible range partitions compared to range-based sharding as range values get larger. When combined with efficient server distribution algorithms, it is possible to optimize hash-based sharding systems and distribute resources across multiple locations, further improving its efficiency.

Eventual consistent sharding is designed to handle large and dynamic datasets by scaling out horizontally with minimal downtime. This type of sharding relies on eventual consistency, meaning that any changes made to a dataset will eventually be reflected everywhere in the system, but not necessarily immediately. Eventual consistency enables applications to maintain their scalability and performance while still ensuring data integrity and consistency. By making use of eventual consistent sharding, applications can better manage the load on each node, allowing faster response times and higher availability. Additionally, it makes it easier for applications to scale out when needed, as additional nodes can be added without significantly disrupting the operation of the system.

Data Partitioning Advantages and Challenges

Doug Rose 1 个月前

A deep dive: What is LSM tree?

Vivek Bansal 3 个月前

Vector Databases: A Game-Changer in the World of…

Amr Saafan 9 个月前

The main reason to partition data is scalability

Different partitions can be placed on different nodes in a shared-nothing cluster .?A large dataset can be distributed across many disks, and the query load can be distributed across many processors. For queries that operate on a single partition, each node can independently execute the queries for its own partition, so query throughput can be scaled by adding more nodes. Large complex queries can potentially be parallelized across many nodes, although this gets significantly harder.?Some of the main advantages of data partitioning are:

Scalability: Data partitions can be spread across multiple servers, and continue scaling it (think auto-scaling in the context of a cloud based solution) out by adding more servers in response to increased demand.
Performance: Each partition contains a subset of the data supporting faster data access, ability to parallelize operations over multiple data partitions, and reducing network latency by locating the partition closest to the application that uses it.
Availability: No single point of failure. If a server fails, only the data in that partition becomes unavailable. Replicating partitions ensures continuity of business operations and improves data resiliency regardless of planned outages like routine maintenance or unplanned outages such as unexpected failures.
Security: Data can be partitioned into separate sensitive and non-sensitive data partitions. Sensitive data partitions can have their own servers, data stores, and tailored protocols to ensure their safeguarding.

While partitioning data can improve scalability, performance, availability, and security, it also presents its own unique set of challenges. Data partitioning can increase complexity in a few ways:

It can be difficult to determine the correct partition key and/or criteria when dividing a large dataset into logical subsets. The wrong choice of partition key or criteria can lead to inefficient access to data or suboptimal performance.
It becomes more complex to maintain data integrity with data partitioning since each node needs to be kept in sync with the other nodes associated with the same dataset. This means that each node needs to have an up-to-date copy of the dataset for consistency purposes. Additionally, if a node containing certain partitions fails, extra steps have to be taken to replicate these partitions on other nodes as quickly as possible in order to maintain continuity of operations.
There is also an added challenge when scaling out due to the need for different nodes to communicate and coordinate between one another in order to properly redistribute shards across multiple nodes while retaining their original state. This process can become quite complex depending on the size and nature of the dataset being handled by the system. Furthermore, additional security measures need to be implemented since multiple nodes can now be holding sensitive information related to the dataset in question.
Hotspots is a partition with disproportionally high-load. For example, if we have a CRM solution, and we partition the database A through Z (26 partitions for each letter of the alphabet) by customer name. In a na?ve approach, we may assume that each partition will contain 1/26 ~ 3.8% of the data, but in reality, we may have 30% of our customers names that start with S. Our S partition will on average experience ~ 9X more demand/querying than other partitions, assuming the remaining 25 partitions have an equitable data distribution. For example, data that is frequently changed can cause inconsistent data values. Additionally, data access logic will need to be adjusted when implementing a new partition. Furthermore, large quantities of existing data may have to be migrated from one partition to another which is be time-consuming and disruptive for users.

Choosing the right data partitioning design depends on your workload, data size, and scalability objectives.

Following are some questions to consider when designing data partitions:

Which data is most frequently accessed?
Which data access pattern will most likely cause degradation in performance?
Which workload will require high throughput?
How often will the data be accessed?
How will you determine which partitions are most critical?

It is essential to determine what kind of data will be stored in each partition. Partitions are organized into groups according to their key column. Partitions should not overlap, and in general, dividing data into separate partitions can increase scalability.

Best Practices

Choose the Right Partition Strategy: The choice of partition strategy will have a major impact on the performance and scalability of the database. It is important to carefully consider the type of data being stored and the specific requirements of the database when choosing a partition strategy.
Consider the Trade-Offs: The trade-offs between consistency, availability, and partition tolerance must be carefully considered when implementing Data Partitioning and Data Sharding. The best partition strategy will depend on the specific requirements of the database, and the trade-offs between these properties must be carefully weighed.
Plan for Scalability: When implementing data partitioning and data dharding, it is important to plan for scalability. The database should be designed in such a way that it can easily be scaled as the amount of data grows, like planning for the addition of new partitions or shards as needed.
Monitor Performance: It is important to monitor the performance of the database regularly to ensure that it is operating efficiently. Performance monitoring can help to identify bottlenecks and other issues that may be affecting the performance of the database.
Apply Automation: Automated processes can help to ensure that the database remains well-organized and optimized. For example, automated processes can be used to keep track of new data partitions or shards as they are added, ensuring that all changes are tracked and accounted for. Automation can also help with other maintenance tasks such as index updates, backups, and other system monitoring tasks.
Ensure Security: It is important to ensure that the data partitions or shards used in the database are secure from unauthorised access by implementing encryption and authorization protocols.
Test, Test, Test: Before deploying a partitioned or sharded database to production, it is important to thoroughly test the implementation to ensure that it is working correctly (i.e., testing the partitioning strategy, the sharding strategy, and the scalability of the database).

Conclusion

Data partitioning and data sharding are crucial techniques for achieving the scalability and performance that modern databases require. The CAP Theorem provides a useful framework for understanding the trade-offs involved in these techniques, and different partition strategies can be used to balance these trade-offs in accordance with the specific requirements of a given database. By following best practices for Data Partitioning and Data Sharding, organizations can maximize the benefits of these techniques and ensure that their databases are able to meet the demands of the modern world.

Whether you are a database administrator, a software engineer, or simply someone looking to improve the performance and scalability of your database, understanding the basics of Data Partitioning and Data Sharding is essential. By embracing these techniques and following best practices, you can take your database to the next level, ensuring that it is able to deliver the performance and scalability that your organization needs to succeed in the digital age.

FAQs

Q: What is Data Partitioning?

A: Data Partitioning is the process of dividing a large database into smaller, more manageable pieces called partitions. This allows for improved performance and scalability, as well as reduced downtime in the event of a failure.

Q: What is Data Sharding?

A: Data Sharding is a specific type of Data Partitioning in which the data is divided into smaller pieces called shards and distributed across multiple servers. This allows for improved performance and scalability, as well as reduced downtime in the event of a failure.

Q: What is the CAP Theorem?

A: The CAP Theorem is a framework for understanding the trade-offs involved in Data Partitioning and Data Sharding. It states that it is impossible for a distributed system to simultaneously provide Consistency, Availability, and Partition Tolerance. Therefore, it is important to carefully balance these trade-offs when implementing Data Partitioning and Data Sharding.

Q: What are the trade-offs between Consistency, Availability, and Partition Tolerance?

A: Consistency refers to the requirement that all nodes in a distributed system see the same data at the same time. Availability refers to the requirement that the system is always available, even in the event of a failure. Partition Tolerance refers to the requirement that the system continues to operate, even in the event of a network partition. The trade-off between these properties is that as the requirement for consistency increases, the requirement for availability and partition tolerance decreases, and vice versa.

Q: What are the best practices for Data Partitioning and Data Sharding?

A: The best practices for Data Partitioning and Data Sharding include choosing the right partition strategy, considering the trade-offs, planning for scalability, monitoring performance, using the right tools, and thoroughly testing the implementation before deployment. By following these best practices, organizations can ensure that their Data Partitioning and Data Sharding implementations are effective and efficient.

Q: Why are Data Partitioning and Data Sharding important for modern databases?

A: With the increasing volume of data being generated and stored in modern databases, scalability and performance are becoming increasingly critical issues. Data Partitioning and Data Sharding are essential techniques for addressing these issues, as they allow for improved performance and scalability, as well as reduced downtime in the event of a failure. By embracing these techniques, organizations can ensure their databases are able to meet the demands of the modern world.

Jerry Eshbaugh

Enterprise Software Architect and Strategic Advisor

1 年

Great solutions to common problems with large data sets Basil! A great way to identify the root cause of slow relational database performance is to start with response time monitoring, this will confirm that the problem is actually the database, and when it is occurring. Next dive into Wait Events (Wait Types), these are produced by all major RDBMS and will tell you exactly what you are waiting on for the queries with the slowest response times. #SQLServer #OracleDB #PostgreSQL #MariaDB

1 次回应

Kamran Akhtar

1 年

Good stuff Bassel! One additional layer of complexity occurs when the datasets aren’t all contained within any specific scalable architecture - and more specifically between high and low side national security data architectures.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Large datasets, slow queries, now what?

Bassel Haidar

| AI | Data | Automation |

Background

CAP Theorem

Partitioning Types

Horizontal Partitioning

Vertical Partitioning

Partitioning Techniques

Data Partitioning Advantages and Challenges

领英推荐

Best Practices

Conclusion

FAQs

更多精彩文章

社区洞察

其他会员也浏览了

Data Partitioning and Sharding - From Scratch

Graph Database - Trying out Neo4J

How Materialized Views Turbocharge Data Access in Modern Databases

Decoding the Data Platform: Latest thoughts from the field....

Bad Fashion: Open Data Lakehouses

8 Data Structures Powering Modern Databases-Scaler

Database Sharding

Takeaway for this week's announcement

Navigating the Evolving Demand for Data Engineering Service Providers

An Introduction to NebulaGraph's Storage Engine

Background

CAP Theorem

Partitioning Types

Horizontal Partitioning

Vertical Partitioning

Partitioning Techniques

Data Partitioning Advantages and Challenges

领英推荐

Best Practices

Conclusion

FAQs

AI's Next Step in Reasoning

2024年10月3日

The Rise of Compound AI Systems

2024年7月19日

Talk to your PDF Documents

2023年5月31日

Transform your Data Landscape with the Superheroes of Data Management

2023年5月9日

Nothing is Ever Beyond Your Reach

2022年12月23日

Move over Judge Judy AI is here

2022年6月6日

Your Machine Learning (ML) Model is Wrong, now what?

2020年12月27日

社区洞察

其他会员也浏览了

Data Partitioning and Sharding - From Scratch

Graph Database - Trying out Neo4J

How Materialized Views Turbocharge Data Access in Modern Databases

Decoding the Data Platform: Latest thoughts from the field....

Bad Fashion: Open Data Lakehouses

8 Data Structures Powering Modern Databases-Scaler

Database Sharding

Takeaway for this week's announcement

Navigating the Evolving Demand for Data Engineering Service Providers

An Introduction to NebulaGraph's Storage Engine