Optimizing Database Storage: From Hash Indexes to B-Trees

Optimizing Database Storage: From Hash Indexes to B-Trees

Introduction

Databases perform two critical tasks: storing data and retrieving it efficiently. The way data is structured on disk directly impacts performance. This newsletter explores different storage architectures used in modern databases to optimize these processes.

The Simplest Database (Key-Value Store):

At its core, a simple database can be implemented as a key-value store using basic bash functions.


Problem Statement:

  • Set operations have good performance due to the append-only log file.
  • Get operations suffer from poor performance, especially with a large number of records, resulting in O(n) complexity.

The Need for Efficient Data Retrieval:

The challenge lies in efficiently locating specific values within a vast dataset. The solution involves using data structures called indexes, which act as signposts to quickly locate data.

Problem: Write Overhead-Indexes can introduce write overhead, which needs to be carefully managed.

Different Types of Indexes:

Several indexing techniques have evolved to address these challenges:

·?????? Hash Indexes

·?????? SSTables and LSM Trees

·?????? B-Trees

Hash Index

Hash indexes are suitable for key-value stores. They function similarly to dictionaries, maintaining a hash map in memory. Each key is mapped to a byte offset within the data file. When a write operation occurs, both the file and the hash map are updated.


Implementation:

·?????? BitCask (storage engine in Riak)

Problems:

·?????? Keys must fit in memory.

·?????? Maintaining an on-disk hash map leads to random access I/O and inefficient range queries.

·?????? What if the data file size is too big to fit on a single disk?

Solution:

Break the file into segments of a certain size. When a segment reaches a specific size, it's closed. Compaction is then performed on these segments in a background thread, eliminating duplicates and retaining only the most recent updates for each key. After completion of the compaction, new read requests are switched to the compacted segment in a rolling update fashion. The old file can be removed if historical states aren't required.


Each segment can have multiple in-memory hash tables. The most recent hash map is checked first during a key search, and so on.

Lots of detail goes into making this simple idea work in practice. Briefly, some of the issues that are important in a real implementation are:

  • File Formats: CSV is not the best format for a log. It’s faster and simpler to use a binary format that first encodes the length of a string in bytes, followed by the raw string
  • Deleting Records: Special deletion records(tombstone) are used to ignore deleted keys during compaction.
  • Crash Recovery: Rebuilding the key from segments can be time-consuming. Periodically store snapshots of each segment to disk and load them into memory during crashes.
  • Partially Returned Records: Include a checksum to ignore corrupted records.
  • Concurrency Control: Append-only/immutable mode results in sequential writes, which are faster than random writes and allow multiple concurrent read threads.

SSTables and LSM Trees:

A solution for hash indexes that don't fit in memory is using Sorted String Tables (SSTables) and Log-Structured Merge Trees (LSM Trees). Data is sorted, and a sparse index is stored in memory.


  • When multiple segments contain the same key, the value from the most recent segment is kept, and values in older segments are discarded. Data within the segment is in sorted order.
  • Since the table on disk is sorted, it is not necessary to maintain all keys in memory – the hash table will be sparse.
  • For example in above figure, to find the key "handiwork", and you know the offsets for the keys "handbag" and "handsome", you can jump to the offset for "handbag" and scan from there.
  • Key-value pairs in a requested range can be grouped into a block and compressed before writing to disk. Each entry of the sparse in-memory index then points at the start of a compressed block.

How to maintain the segment in sorted order while writing?

  • Maintaining a sorted structure in memory is much easier than on disk.Use tree data structures such as red-black trees or AVL trees.

The process is as follows for the write:

  • Add it to an in-memory balanced tree data structure (memtable).
  • When the memtable gets bigger than some threshold, write it out to disk as a new SSTable segment file.
  • While writing to disk, new writes can write to a new memtable instance.
  • In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
  • Run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values periodically.
  • If the database crashes, the most recent writes in memtable which haven’t flushed are lost. To avoid data loss, maintain a log in the disk in an append mode (not sorted) as a backup to restore memtable

Making an LSM-tree out of SSTables

Modern key-value storage engines like LevelDB and RocksDB—used in Spark Streaming for stateful execution—follow principles from Google’s Bigtable. These engines, also found in Cassandra and HBase, leverage SSTable and memtable structures for efficient storage.

Similarly, Lucene (powering Elasticsearch and Solr) employs a comparable approach for term dictionary storage.

Key Challenge: Searching for non-existent keys can degrade performance due to unnecessary scans.

Solution: Bloom filters optimize lookups by approximating set membership.

Compaction Strategies:

  • Size-tiered (HBase, Cassandra): Merges smaller SSTables into larger ones over time.
  • Leveled (LevelDB, RocksDB, Cassandra): Distributes data across levels, optimizing reads and writes.

B-Trees

The log-structured indexes we have discussed so far are gaining acceptance, but they are not the most common type of index. The most widely used indexing structure is quite different: the B-tree. It's the standard implementation in all RDBMS.


  • Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups and range queries. But that’s where the similarity ends: B-trees have a very different design philosophy.
  • LSM break into segments (several mb or more in size) and write sequentially, but B-trees break it into fixed-size blocks and pages (4kb in size), similar to underlying disks.
  • Each page can be identified using an address or location, which allows one page to refer to another—similar to a pointer, but on disk instead of in memory. We can use these page references to construct a tree of pages
  • The number of references to child pages in one page of the B-tree is called the branching factor.

B-tree (Updates):

  • Updating a Key: Locate the leaf page, modify the value, and write it back (references remain valid).
  • Adding a Key: Find the page covering the key’s range and insert it.
  • Handling Space Constraints: If the page is full, split it into two and update the parent page.
  • Efficiency:Maintains balance with a depth of O(log n).Most databases fit within 3-4 levels, minimizing lookups.A 4-level B-tree (4KB pages, branching factor 500) can store 256TB efficiently.

Conclusion: Choosing the Right Storage Engine

Databases leverage different storage architectures based on their needs:

  • Hash Indexes (e.g., Bitcask in Riak) offer fast lookups but require high memory.
  • LSM-Trees & SSTables (e.g., LevelDB, RocksDB, Cassandra, HBase) optimize writes and compactions for high-throughput workloads.
  • B-Trees (e.g., MySQL, PostgreSQL, SQLite) ensure balanced structures for efficient reads and updates.

Each approach has trade-offs, influencing performance in databases from relational (MySQL, PostgreSQL) to NoSQL (Cassandra, Elasticsearch). Choosing the right engine depends on workload patterns—balancing speed, storage efficiency, and query performance


Carlos Costa

Director of Data & Analytics | Professor | PhD | Book Author | Open-Source Enthusiast

3 周

The way you played out the info is super insightful and digestable for such a deep dive topic in data engineering Grifith Mathew Just 2 additional cents on the topic, which for me is what makes it very difficult for us to sometimes be fully aware of what's going on in a system internally. And then, some of the literature also gets outdated quite fast as the system evolves. Making it even more challenging to keep up. Some of the NoSQL databases actually have a very complex architecture using hybrid and customized approaches to tackle different problems inside the system. For example, to the best of my knowledge, Google Big table, the one inspiring HBase, on top of the lsm tree index approach for data, used an analogy to a b+tree, to store tablet location information: https://research.google.com/archive/bigtable.html This complex hybrid and analogous approach, of course sometimes is what makes systems so robust and performant, also makes it hard for practitioners to clearly label a system as using one approach over the other... Sometimes they are just hybrids ?? But that does not take anything from the article, as hbase is lsm tree based, just wanted to provide this sometimes hybrid approach that systems take. Great stuff!

Sumanth Munnangi

Big Data Engineer @ Adidas | Databricks | AWS | PySpark

4 周

Nice info!

Jithin Babu

Data Engineer | Digital Data Analytics

1 个月

Very informative

要查看或添加评论,请登录

Grifith Mathew的更多文章

  • Encoding and Storage Formats

    Encoding and Storage Formats

    Avro Another binary encoding format that is interestingly different from Protocol Buffers and Thrift which we have…

    1 条评论
  • Encoding and Schema Evolution: Adapting to Change

    Encoding and Schema Evolution: Adapting to Change

    "No man ever steps in the same river twice, for it's not the same river and he's not the same man.”- Heraclitus The…

    1 条评论
  • From Classical to Quantum Physics

    From Classical to Quantum Physics

    The evolution of quantum mechanics is a remarkable journey that has transformed our understanding of the universe, from…

社区洞察

其他会员也浏览了