Optimizing Database Storage: From Hash Indexes to B-Trees
Grifith Mathew
Data Engineering, Cloud Architecture, and AI | Unlocking Business Insights and Growth
Introduction
Databases perform two critical tasks: storing data and retrieving it efficiently. The way data is structured on disk directly impacts performance. This newsletter explores different storage architectures used in modern databases to optimize these processes.
The Simplest Database (Key-Value Store):
At its core, a simple database can be implemented as a key-value store using basic bash functions.
Problem Statement:
The Need for Efficient Data Retrieval:
The challenge lies in efficiently locating specific values within a vast dataset. The solution involves using data structures called indexes, which act as signposts to quickly locate data.
Problem: Write Overhead-Indexes can introduce write overhead, which needs to be carefully managed.
Different Types of Indexes:
Several indexing techniques have evolved to address these challenges:
·?????? Hash Indexes
·?????? SSTables and LSM Trees
·?????? B-Trees
Hash Index
Hash indexes are suitable for key-value stores. They function similarly to dictionaries, maintaining a hash map in memory. Each key is mapped to a byte offset within the data file. When a write operation occurs, both the file and the hash map are updated.
Implementation:
·?????? BitCask (storage engine in Riak)
Problems:
·?????? Keys must fit in memory.
·?????? Maintaining an on-disk hash map leads to random access I/O and inefficient range queries.
·?????? What if the data file size is too big to fit on a single disk?
Solution:
Break the file into segments of a certain size. When a segment reaches a specific size, it's closed. Compaction is then performed on these segments in a background thread, eliminating duplicates and retaining only the most recent updates for each key. After completion of the compaction, new read requests are switched to the compacted segment in a rolling update fashion. The old file can be removed if historical states aren't required.
Each segment can have multiple in-memory hash tables. The most recent hash map is checked first during a key search, and so on.
领英推荐
Lots of detail goes into making this simple idea work in practice. Briefly, some of the issues that are important in a real implementation are:
SSTables and LSM Trees:
A solution for hash indexes that don't fit in memory is using Sorted String Tables (SSTables) and Log-Structured Merge Trees (LSM Trees). Data is sorted, and a sparse index is stored in memory.
How to maintain the segment in sorted order while writing?
The process is as follows for the write:
Making an LSM-tree out of SSTables
Modern key-value storage engines like LevelDB and RocksDB—used in Spark Streaming for stateful execution—follow principles from Google’s Bigtable. These engines, also found in Cassandra and HBase, leverage SSTable and memtable structures for efficient storage.
Similarly, Lucene (powering Elasticsearch and Solr) employs a comparable approach for term dictionary storage.
Key Challenge: Searching for non-existent keys can degrade performance due to unnecessary scans.
Solution: Bloom filters optimize lookups by approximating set membership.
Compaction Strategies:
B-Trees
The log-structured indexes we have discussed so far are gaining acceptance, but they are not the most common type of index. The most widely used indexing structure is quite different: the B-tree. It's the standard implementation in all RDBMS.
B-tree (Updates):
Conclusion: Choosing the Right Storage Engine
Databases leverage different storage architectures based on their needs:
Each approach has trade-offs, influencing performance in databases from relational (MySQL, PostgreSQL) to NoSQL (Cassandra, Elasticsearch). Choosing the right engine depends on workload patterns—balancing speed, storage efficiency, and query performance
Director of Data & Analytics | Professor | PhD | Book Author | Open-Source Enthusiast
3 周The way you played out the info is super insightful and digestable for such a deep dive topic in data engineering Grifith Mathew Just 2 additional cents on the topic, which for me is what makes it very difficult for us to sometimes be fully aware of what's going on in a system internally. And then, some of the literature also gets outdated quite fast as the system evolves. Making it even more challenging to keep up. Some of the NoSQL databases actually have a very complex architecture using hybrid and customized approaches to tackle different problems inside the system. For example, to the best of my knowledge, Google Big table, the one inspiring HBase, on top of the lsm tree index approach for data, used an analogy to a b+tree, to store tablet location information: https://research.google.com/archive/bigtable.html This complex hybrid and analogous approach, of course sometimes is what makes systems so robust and performant, also makes it hard for practitioners to clearly label a system as using one approach over the other... Sometimes they are just hybrids ?? But that does not take anything from the article, as hbase is lsm tree based, just wanted to provide this sometimes hybrid approach that systems take. Great stuff!
Big Data Engineer @ Adidas | Databricks | AWS | PySpark
4 周Nice info!
Data Engineer | Digital Data Analytics
1 个月Very informative