登录查看更多内容

Database Indexing and Partitioning

Syed Talha Alam

Manager Research and Data Analytics @ K-Electric | Data Analytics, Qlik Sense, Power BI, ETL, Data Warehouse, Data Pipelines, Data Modeling, Semantic Modeling, Self Service BI, BI Reporting & Dashboarding

发布日期: 2023年3月26日

The importance of efficient data storage and retrieval in modern applications that require fast data access. The use of indexing and partitioning to achieve this is explored, along with their respective best practices and anti-patterns. Database indexing is an internal data structure used to look up a record quickly, which can improve database performance by reducing the amount of data read by queries. Partitioning involves arranging data in a database to be accessed more efficiently and can speed up queries by reducing the amount of data queries have to scan. I would also suggest that partitioning can be done on a database level to separate archive records from recent data, resulting in faster queries.

I am using a hypothetical example of a connected vehicle to illustrate the importance of efficient data storage and retrieval in modern applications. The connected vehicle collects data from various sensors to enhance the driving experience, and the speed of access to the database is crucial to making split-second decisions. I am highlighting the importance of using indexing and partitioning to improve database performance and reduce query time. The characteristics, commonalities, and differences between indexing and partitioning are discussed in relation to the hypothetical database used by the connected vehicle.

let’s consider a table a connected vehicle uses to log its location every second. The vehicle inserts a new record into a table called “Location”. There are no deletes or updates to this table, just inserts.

CREATE TABLE [dbo].[Location]
	[location_id] [bigint] NOT NULL,
	[latitude] [int] NOT NULL,
	[longitude] [int] NOT NULL,
	[log_time] [datetime] NOT NULL
) ON [PRIMARY]

Notice, a simple table without any indices is defined, and highlights that the records in the table are not ordered by any field. This makes reading the table like reading a book without page numbers, one has to scan every page looking for the relevant information. This method of reading the table is inefficient and takes up valuable time and resources, making it the worst way to read a book or a table.

In this scenario, Indices can be used to bring order to tables, similar to how page numbers are added to books to order chapters. A clustered index is an index that provides order to a table, and it can be defined on a single column or by grouping a set of columns that combine to make a unique value, which is known as a Composite Key.

ALTER TABLE
? dbo.Location?
ADD?
? CONSTRAIN PK_Location PRIMARY KEY CLUSTERED (location_id) ON [PRIMARY]?

In this case, the table's records are ordered by the location_id column using an index, which is implemented using data structures such as B-trees and hash tables. The B-tree is the most commonly used data structure, and it organizes the index as a tree of pages with a root, intermediate, and leaf level. This structure has several advantages, including efficient random searches and sequential access, as well as the ability to handle insertions and deletions with partially full pages.

No alt text provided for this image — B-Tree structure of a clustered index

How has the addition of a clustered index improved performance? Let's take the example of a query where we want to retrieve records with location_id between 5 million and 5.2 million. With the index, the RDBMS can quickly locate the location of the five millionth record, read the subsequent rows up to 5.2 million, and stop without having to scan the rest of the table.

This can save a significant amount of time in a table with potentially billions of rows. It's worth noting that indexes can also be non-clustered and used to search for data in any column. For example, if we needed to search for location_ids based on a specific log_time value in the table.

Select
? location_id?
from?
? [Location]?
where?
? log_time = '2022-05-30 00:16:35.993'?

When examining the query's execution plan, it is evident that the RDBMS opts for a scan operation to locate the data, which is both time-consuming and costly.

It is possible to create an additional index that enables queries to search by log_time, as follows:

CREATE NONCLUSTERED INDEX [IX_location_log_time] 
ON [dbo].[Location] ([log_time] ASC)

Creating a new index allows queries to use it when searching for data using the log_time column in WHERE conditions, resulting in a faster and more efficient "Seek" operation when fetching the data.

领英推荐

Graph Databases: Assessment and Optimization Strategies

Buxton Consulting 3 周前

Synthesizing Multi-Table Databases: Model Evaluation &…

Vincent Granville 9 个月前

Unveiling the Evolution and Significance of Databases:…

Sanjay K Mohindroo. 1 年前

Excellent! So if indexes are so beneficial and accelerate our queries, why not create them for each column? The reason is that they come at a cost.

Each index requires effort to build and sustain. Essentially, when an index is created, we are copying data from the designated column(s) and forming our B-Tree framework, which then expands the database with new pages. This not only takes up storage but also has significant implications when we perform operations like INSERT, UPDATE, or DELETE on the table since these pages are modified or removed. As a result, Indices can impede these operations and must be used prudently for tables with frequently modified data.

Note: Which columns are most suitable for indices? Ideally, we should pinpoint columns in large tables that are frequently used in JOIN conditions and index those columns. This optimizes any queries that use those joins. We should also identify columns that are frequently used in GROUP BY queries and index them, as this speeds up Aggregation queries. Whenever we need to sort data, locate rows by column values, or correlate data across tables, indices can improve query performance.

Partitioning basics & Best Practices

Partitioning involves dividing large datasets into smaller units to improve query efficiency, enhance data security, and reduce disk contention, among other benefits. There are two main ways to partition data in a database: database partitioning and table partitioning. While both methods involve splitting a large database into smaller, more manageable parts, they differ in their implementation. The choice of which method to use depends on the specific needs of the application and the database architecture. Some databases, such as Amazon Aurora and PostgreSQL, support table partitioning, while others, such as MySQL, only support database partitioning. The primary advantage of partitioning is that it enables parallelism, which improves performance in high-throughput applications that store vast amounts of data, such as OLTP and large data analytics systems.

Database Level Partitioning:

Dividing a database into distinct sections is known as partitions called database partitioning. Each partition has a unique copy of the data in the database and works as an independent database. This separation allows you to scale your application across multiple servers without affecting its functionality. Furthermore, it enhances the efficiency of queries that access data in different partitions, resulting in a smaller database and faster application speed.

Table Level Partitioning:

To split a single table into multiple tables, we use a process called table partitioning, which can be done vertically, horizontally, or functionally. This technique is particularly helpful for dealing with enormous tables that contain billions of rows of data, such as the Location table in a connected vehicle application. As time goes on and the table grows, queries accessing the table slow down, even with indexing. To reduce the amount of data without losing historical data, table partitioning can be employed.

Let's assume that the Location table we mentioned earlier has been actively used for over a decade and has billions of rows of data. Additionally, there are two distinct types of queries that are run on this data:

1) Current queries: These queries are executed by the software running in the connected vehicle to make speedy decisions and only need to retrieve Location data for the present month.

2) Historical queries: These queries are utilized for generating reports. They access data that is older than the current month and do not require rapid processing.

With the knowledge of the query types that operate on our table, we can now partition it into two sections: a "Current" partition and a "Historical" partition. Both partitions will have non-clustered indices from the original table.

This partitioning will provide immediate performance benefits to Current queries as they will only need to access the Current partition with a month's worth of data. Similarly, Historical queries will only execute against the Historical partition.

Another benefit of Partitioning the Location table into Current and Historical partitions provides additional benefits beyond improved query performance. As new data is added to the Current partition, INSERTs, UPDATEs, and DELETE statements will run faster because the non-clustered index on the Current partition will be smaller. This reduction in data size will also significantly improve the speed of Current queries.

Conclusion

Optimizing database performance involves using various tools such as indexing and partitioning. It is crucial to weigh the advantages and disadvantages of each method when selecting the appropriate one. Both techniques aim to reduce the data volume accessed by queries for faster execution. Indices are most effective on tables that have low data churn, meaning tables that undergo fewer INSERTs, UPDATEs, and DELETEs, while partitioning is more efficient for handling operations on large tables. In addition, partitioning can serve as an excellent approach for storing older records.

Yasir Hussain

1 年

Will give it a read Thanks for sharing

2 次回应

Ronaq Ali

ETL Developer at 10Pearls

Very well explained ????

查看更多评论

Database Indexing and Partitioning

Syed Talha Alam

Manager Research and Data Analytics @ K-Electric | Data Analytics, Qlik Sense, Power BI, ETL, Data Warehouse, Data Pipelines, Data Modeling, Semantic Modeling, Self Service BI, BI Reporting & Dashboarding

领英推荐

Partitioning basics & Best Practices

Database Level Partitioning:

Table Level Partitioning:

Conclusion

社区洞察

其他会员也浏览了

Time Series Databases: Optimizing for Performance, Scalability, and Analysis

Indexing and Hashing in DBMS

Disrupting the Data Storage Landscape: How Vector Databases are Revolutionizing Traditional Storage Methods

How to Read Graph DataBase Benchmarks (Part-1)

Graph Database Benchmarks Demystified

Big Data Testing

ElasticSearch Reindexing: Zero Downtime Approach

Modernizing Data Classification

The Evolution from Relational Databases to AI-Driven Knowledge Integration

Data Partitioning in Key-Value Stores: System Design