登录查看更多内容

How data is organized on Disk in database systems

Manish Pokhriyal

Building scalable systems

发布日期: 2024年6月23日

One of the most important job of a database system is to store data and also allow quick access to the data stored in database. But there are questions which most of us have.

How is the data organized when we store it in database ?
Why do we need a database management system and not just a bunch of files?
How does file organization improve efficiency ?

Database systems uses files to store the data, but instead of relying on file system hierarchies of directories and files (as we use in our daily life for organising our files on the windows, linux, mac os) for storing and locating the data. Databases uses specific file formats which is more efficient to handle and work with.

Disclaimer : The article addresses the techniques or storage internals used mostly in relational databases. I am primarily assuming the storage hardware is HDD and the same concepts are extended to the SSDs with minor changes and is abstracted by the SSD controller.

Main reasons databases use different file organisation over flat files

[Optimising Storage]

Files are organised in a way that optimises storage overhead per stored data record. We don't want to end up with disk space wastage while storing the records.

[Optimising record Access]

Records can be located or found on disk in smallest possible numbers of steps. In other words optimising the disk seeks.

[Optimising Update]

Aiming to optimise the record updates in a way that it minimises the number of changes on the disk.

Databases store data in tables

Now the question is what is a table ? How does it look like ?

To clarify this, the table is a logical view or representation of similar data or record (which follows some structure like user profile, order details etc). We use tables in the database to visualise the data in our brain and for display purpose. Tables make it easy to understand the data.

What is a record ?

A Record is a single entity telling us more detailed information about something. The detailed information is represented by multiple fields. This is very similar to the Class in OOP.

In the exmaple above we have 3 records in the table. Each row represents a record.

Each table is usually represented by a separate file

Each record in the table can be looked up using a search key which is one of the fields of the record. This is done by selection expression in relational algebra. Often it is represented by the 'where' clause in the queries. (ex. where name="manish").

To locate a record - database uses the indexes only if available, else database need to search through the whole file to locate the record. This is called the heap scan/ full table scan.

Indexes are the data structure that provides the efficient way to locate the records on disc without scanning the whole file/table.
Indexes are stored in index-specific files on-disk and any updates are persisted to the disk periodically. However some part of the index is kept in memory to reduce the disk seeks.
Indexes are built using one or more fields of the records.

For each table database system usually maintains two primary files

Data file - stores data records
Index file - stores the record metadata using some on disk data structure (usually B-trees).

Index files are usually smaller than the data files, as we only store subset of the record instead of the whole record.

However there is an exception to this i.e index organised files, we will learn about them shortly.

The database files (data or index) are partitioned into Pages, which typically have the size of a single or multiple disk blocks (or sectors).

Pages are the fundamental units of I/O operations and memory management in database systems.

In simple terms databases are not programmed to read or write a single record from the disk.

Databases reads the whole page from the disk even if we are looking for single record and brings it to the working main memory.

For writing a record to the database, it can not write in-place on the disk. First the page is identified where the record needs to be written. Databases navigate to the page either using indexes or a sequential heap file scan. Once the page is located, it is brought to the working memory and the new record is added/appended and the whole page is overwritten back to the disk.

The diagram below should help visualize things.

Table representation on database file and disk

For now let's not overcomplicate things and understand that the pages are the basic unit which database reads from and writes to the disk. Every page contains some 'n' numbers of records.

For example : If a database page size is 8kb and assuming that every record is of fixed size 1kb then we might only be able to store ~6-7 records per page, given some space is allocated for the page-headers and metadata.

We will learn about the page structure in detail in upcoming articles.

Data Files

Data files where the actual records are stored can be implemented as :

Heap Organised Tables
Index Organised Tables

Heap Organised Tables / Heap file

Records in a heap file need not follow any particular order and mostly they are added in the order they are written.

This ensures that no heap file reorganisation is required when new pages are appended, since we are not maintaining any order for the records. Note that the records are added to the pages and the collection of pages represents the file (In this case heap file)

So this is the reason the heap files require full file scan to locate any record. To overcome this problem we use index files which points to the locations of the records in the heap files.

Index Organised Tables/Files

Index organised tables stores both index and the data records in the same file. So we follow the index and locate the whole record which is also present in the same file.

领英推荐

SQL Server Fine-Tune & Optimization Guide

Arunagiri Gopal 1 年前

Howto Validate years of SQL Audit files with a…

CloudXcellence 1 年前

What are databases, and why might I need a database?

LocalTeam Australia 1 年前

As the records are stored in the sorted order of some key (on which index is built), range scans in Index org. tables can be implemented by sequentially scanning its contents

While this was not the case with the heap files, where range scans would take database to read/scan the whole file.

We are storing the whole record/actual data in the index, this is different from having a separate index file where we only store subset of the data in the index and we might have the references to the heap file for the complete row or record.

B+ Tree with Index Organised Table structure

Storing whole data in the index file reduces the disk seek by at least one, as we do not need to move to another disk location for the data to be located, everything is in the same file.

Index Files

Before we learn more about the index files. It is necessary to understand what is an index ?

An index is a structure that helps database to efficiently retrieve the data records from the data files.

We can relate the database index to the binary search tree data structure which is used to efficiently search the key we are looking for. But binary search trees are loaded into the RAM, it is available in memory till we are running the program and the computer is powered on.

It is important to note that we navigate to the left/right child pointers to reach the key we are looking for. Where left & right child are basically the random memory location in the RAM which was allocated to the node dynamically. We can save binary tree to the file but when we load it again the memory locations for the nodes will never be same.

However if we talk of the database indexes, they are made on top of subset of columns of the record.

When database navigate through the index to reach the key we are looking for, Disk head moves to multiple physical disk locations (known as disk seek). Upon moving to the search key, database get to know the disk location of the actual page where the complete record exists which is a part of data file.

In summary, Index files are organised as specialised disk based structures that map keys to locations in data files where the records are stored. Data file could be the heap file or the Index organised file.

Primary Index

A primary index is an index structure created on the primary key column(s) of a table. This index follows all the properties of the primary key such as enforcing uniqueness, not allowing null values and acting as a record identifier etc.

A primary index may or may not be clustered.

A primary index may be built on top of a heap file or with Index organised table (which has both index and data organised in the same file).

B+ Tree with Separate Index and Heap files

Secondary Index

Secondary indexes which again are built on subset of columns of a record can point directly either to the record in the heap file or simply store the reference to the primary Index.

A pointer to a data record can hold an offset to a heap file or an index organised file.
Multiple secondary indexes can point to the same record, where same record is identified by different fields and located through different indexes.
Secondary index files can hold multiple entry per search key, while primary index files strictly holds one unique entry for each key.

We just discussed the two approaches for the secondary indexes to point to the data records.

Whether the records should be referenced directly (through file offset)
Whether the secondary index should navigate to the actual record via the primary key index.

Both of these approaches have some pros and cons. Databases implementing any of this approach has to trade-off the either read or write speed.

By referencing data directly, we can reduce the number of disk seeks (required for lookup in primary index), but have to pay a cost of updating the pointers in all the secondary and primary indexes whenever the record is updated or relocated during the maintenance process.

Using Indirection , where the key in secondary index points to the record via the primary index.

It allows the database to reduce the cost of pointer updates (if the record location changes during maintenance or updates, only the reference in the primary index needs change), but has a higher cost for the reads while locating a record because an extra disk seek is required to read the primary index.

In case of read-heavy workload, having direct record offset may be acceptable as we have less write queries and it might work well for our use-case if we need to update couple of indexes occasionally. But this approach does not work well for write-heavy workloads having multiple indexes, as the the cost for the writes will be very expensive when it happens very frequently.

Clustered Index

A clustered index is a specific type of index where the data rows of a table are stored in the same order as the index key values.

Data records in the clustered case are usually stored in the same file or in a clustered file, where the key order is preserved.

If the data is stored in a separate file, and its order on disk does not follow the index key order, the index is called non-clustered index.

In most B-tree based systems, the primary key index is often implemented as a clustered index by default. This means that the primary key index determines the physical order of the data rows on disk.

However, it's essential to clarify that the terms clustered index and primary index are often used interchangeably because the primary key index is usually clustered. Therefore, in practice, the primary index and clustered index are often the same.

Index-Organised tables store information in index order and are clustered by definition.

Primary indexes are often clustered.

Secondary indexes are non-clustered by definition, since they’re used to facilitate access by keys other than the primary one. Clustered indexes can be both Index-organised or have separate index and data files.

In the presence of a clustered index, there is no separate heap file for the table.

With this I end the article. I will be covering more about the pages and internal structure of it in upcoming article. If you liked the articles and curious to know more about the database internals, please consider subscribing my newsletter.

References

https://en.wikipedia.org/wiki/Disk_sector

https://www.postgresql.org/docs/current/storage.html

https://martin.kleppmann.com/2017/03/27/designing-data-intensive-applications.html

https://www.databass.dev/

Database Story

196 位关注者

Swaraj Andhale

CSE Student at Vidyalankar Institute Of Technology, Mumbai

2 周

This article is really insightful

要查看或添加评论，请登录

Manish Pokhriyal的更多文章

Anomalies in Database Transactions

2024年8月24日

Anomalies in Database Transactions

Isolation levels defines how different transactions which are running concurrently interact with each other in the…

2 条评论
How Transaction Id wraparound in PostgreSQL can cause a database outage

2024年7月30日

How Transaction Id wraparound in PostgreSQL can cause a database outage

Problem Statement - Transaction Id Wraparound Transaction ID wraparound can cause a significant issue in PostgreSQL…
Comparison of B-Tree based Relational vs Graph database

2024年7月12日

Comparison of B-Tree based Relational vs Graph database

In the world of data management, databases play a crucial role in storing, retrieving, and managing data efficiently…
Page - A basic unit of data transfer in databases

2024年6月30日

Page - A basic unit of data transfer in databases

In my recent article we discussed about how the data is organised on disk at higer level where we saw how the data is…

1 条评论
Understanding Memory allocation in Java

2024年6月17日

Understanding Memory allocation in Java

In this article, we will explore the fundamentals of memory allocation in Java programs. We'll demystify the different…

2 条评论

See all articles

How data is organized on Disk in database systems

Manish Pokhriyal

Building scalable systems