RocksDB, an excellent choice for modern SQL Databases (LSM Tree vs. B-Tree)
Franck Pachot
Developer Advocate at ?? MongoDB ??AWS Data Hero, ?? PostgreSQL & YugabyteDB, ??? Oracle Certified Master
RocksDB is a high-performance embedded data store that powers many modern databases. It is highly customizable and an ideal storage structure for more complex databases. For instance, YugabyteDB uses RocksDB in its transactional storage layer to store the Distributed SQL table rows, index entries, transaction intents, global transaction tables, and catalog tables in a key-value format after sharding them to tablets: each tablet is a Raft group and each peer (leader or follower) is stored in a customized RocksDB datastore. This contrasts with traditional databases, which store data in Heap and B-tree pages.
Many popular databases utilize RocksDB to achieve high performance, such as MariaDB/MySQL (MyRocks), or to scale out, such as TiDB, and YugabyteDB. Even those who have abandoned RocksDB did so for different reasons, such as coding it in a different language (CockroachDB in Go, Pinecone in Rust), but they acknowledged that RocksDB was the right choice. Others, like Instagram, have replaced Cassandra LSM-Tree with RocksDB to improve the reads. More examples are exposed in RocksDB Is Eating the Database World
Write Ahead Logging and In-Memory Buffering
It may seem counterintuitive, but databases that store their data on disk are essentially reading and writing from memory. This is because disks are generally slow, especially when it comes to random access. Mechanical hard disk drives (HDD) are slow for random reads and writes due to the need for the head to move and wait for rotation. On the other hand, flash-based solid-state drives (SDD) don't have this issue and can provide fast random reads. However, random writes on SSDs require copying full pages and are not very efficient.
Traditional databases were originally designed to work with hard disk drives (HDD), which have slow random read and write speeds. These databases typically run on a single server where all connections share the same memory. To address the slow read and write latency, they read and write though a shared buffer pool in the server's random-access memory (RAM).
The data that has been altered in the in-memory buffer pool is periodically synchronized with the disk, known as a checkpoint, reordering and grouping the writes to do less random disk I/O calls. Performing less frequent checkpoints can enhance the performance of write operations, as the pages to write are rearranged to minimize the impact of random writes but only reasonable limit. This approach increases memory usage and the time required to recover from a server failure because more Write-Ahead Logging (WAL) must be applied to restore in-memory changes lost during the failure.
Before making changes to data in memory, databases use Write-Ahead Logging (WAL), also known as a transaction log or redo log, to sequentially describe the changes. This enables the changes to be re-applied for recovery, for example when changes made in volatile memory are lost. The Write-Ahead Log (WAL) is buffered in memory but frequently written to disk or remote nodes to ensure that changes are saved when committed. It is advantageous because it is written sequentially, which is faster, especially with older storage that uses a dedicated HDD for the WAL. However, with SSD, things are different. Transactions are short and rarely aligns with its 4k page size. This results in higher latency when writing partial pages, or wastage when writing full pages. With modern databases, the need for a single stream of sequential WAL covering the entire database instance can be reconsidered.
Physical and Logical Write Operations
Traditional databases used to write their changes to a single instance of the database. They transformed the transaction intention into physical changes to heap tables and B-tree pages to be applied in the shared buffer pool by the session's process. The Write-Ahead Log (WAL) was generated at this level, recording physical change records, identifying the physical address in the data files. When there was a need to replicate to other databases for disaster recovery purposes, it was common to ship the physical change records (log streaming to physical standby). However, this was limited to creating a bit-to-bit replica of the primary database: no subset of tables, no different indexing, and no different database version or host platform to allow rolling upgrades. To address these needs, the physical change records are reverse-engineered to rebuild logical change records (LCRs) that could be applied to a different database. These logical change records identify rows by their primary key rather than their physical address.
It used to make sense for legacy databases to create replication as an additional layer on top of the existing code. However, modern databases, especially those designed for cloud-native environments, come with built-in logical change records and replication. YugabyteDB features a SQL layer based on PostgreSQL that converts complex SQL transactions into key-value read and write operations. In this format, the key represents the primary key for the tables and the indexed columns for the indexes. You can find a detailed description of this format in the following article:
The SQL processing layer communicates with the transaction and storage layer using a key-value API that is independent of the physical location. Several important tasks are included at this level, such as change data capture for event sourcing, sharding and distribution for horizontal scaling, and replication for high availability.
The storage and transaction layer operates on a log-based system using these key-value logical change records. This includes consistent replication as a Raft log and storage as a Log-Structured Merge Tree, which is where RocksDB is relevant.
The Raft Log is the Write-Ahead Log
LSM Tree and its RocksDB implementation is the topic of this article. However, it is was important to set up exactly what is stored in it to understand how RocksDB implementation is relevant. The use of Raft consensus for replication does not necessarily involve the use of an LSM Tree to apply the log in order to persist the database state. However, it is worth mentioning for two main reasons. First, when logical change records are already available, it is convenient to use log-structured storage to persist them quickly and apply them with by merging on their key. Second, the Raft log serves as a Write Ahead Log, and the LSM Tree does not require an additional WAL to protect its in-memory structure. One of the customizations made to RocksDB in YugabyteDB is the elimination of the need for RocksDB WAL.
Log-Structured Merge Tree
In the YugabyteDB's tablet Raft group, whether a peer is a leader or a follower, the key-value logical change records has to be applied to the persistent state. These records must be stored in a way that allows for easy querying by value, typically by using a point or a range in the index. The Write-Ahead Log (WAL) is ordered by time and must be applied to the organized database storage, primarily ordered by its key. However, this can result in a large number of random writes, which are not very efficient on disks. There are two techniques to reduce the need for random writes: B-Tree and LSM Tree.
B-Tree is commonly used in block-based databases, where accessing a value involves traversing from a root block, through a small number of branches, to the leaves. However, applying changes is expensive, as it requires finding the leaf block, updating it, splitting it when it's full, and updating the branch above. Additionally, updating a single byte in a block necessitates writing the whole block during a checkpoint. Nonetheless, this structure is efficient for reading because only a small and predictable number of blocks have to be read to find the point or range. The root and branches have a good chance of being cache hits in the shared buffer pool. The leaf split ensures that the tree remains balanced, maintaining the same branch level for all values. This structure is the best for reducing random reads, which is crucial for mechanical disks (HDD).
领英推荐
LSM Tree works differently. There's no need to read before writing. The changes from the WAL logical change records are appended to a large buffer in memory, known as the MemTable (or MemStore), and ordered in this buffer with pointers (like SkipList) so that a key can be easily found. Writing is much more efficient in LSM Tree compared to B-Tree because it occurs only in memory and doesn't require reading a block from disk and updating it. In an LSM tree, there is no read involved in writing, even though in practice, SQL databases often have to read before to detect a key violation or a current transaction lock intent. Given that the entire database cannot fit in memory tables due to the requirement of infinite RAM and WAL to protect it, the MemTable content must be written to database files.
In contrast to traditional databases where checkpoints involve writing random blocks to update database files, the LSM Tree flushes the entire MemTable to a single file. This approach offers two main advantages in terms of performance and reliability. On the performance side, it involves a single large sequential write, eliminating the need for slow random writes. From a reliability perspective, the files are never updated, reducing the risk of block corruption. Before being flushed, a new MemTable is allocated to handle ongoing writes, while the old MemTable becomes immutable. This creates a sorted run, ordered by the index key and log timestamp, and is flushed to a Sorted Sequence Table file (SST file), still maintaining the order of the key and timestamp. Once flushed, it is never updated. This makes it easy to create snapshots for point-in-time recovery or clones by hard-linking the file. Once flushed, the Write-Ahead Log (WAL) that was kept to protect the MemTable can be reclaimed.
Compared to B-Trees, LSM trees offer numerous advantages. They address the issues of write amplification, block updates, and block splitting as well as the problem of slow random writes. However, one area that needs consideration is reading. Each flush operation adds a new file, leading to two significant consequences:
Both issues are resolved through background compaction. The SST files are immutable, but multiple files read with the sort-merge algorithm can be merged into a single, larger file, to replace the files that were compacted. During compaction, intermediate versions beyond the MVCC retention can be removed. Compaction is different from PostgreSQL vacuum, which competes with ongoing SQL activity by using locks. RocksDB compaction is similar to storage defragmentation that operates in the background, using IO and CPU but never blocking or being delayed by the application activity.
Enhanced RocksDB in YugabyteDB
RocksDB is an implementation LSM Tree, offering high levels of customization to adapt its usage. Deciding whether RocksDB is suitable for SQL databases involves taking into account all the modifications made to implement it in the database. One important feature of RocksDB in YugabyteDB is the sharding of tables and indexes into small tablets, typically 10GB to 100GB, based on their key. These tablets are automatically split as the table grows. To achieve scalability, all storage-related activity occurs at the tablet level, where each tablet represents a Raft group and has its own LSM Tree.
This addresses a major issue with RocksDB, as it may not be well-suited for very large LSM Trees. For example, during full compaction, the required space may double because the current SST files must be kept until the new SST files are written. While a database doubling in size would be unacceptable bloat, with compactions occurring at the tablet level, only small portions of the database temporarily experience this increase. RocksDB has leveled compaction to address this issue, but it is not necessary for YugabyteDB, which stays at level 0 with size-tiered compaction, thanks to the controlled size of tablets. Keeping the size of tablets from growing excessively large by sharding and auto-splitting is also crucial for scalability. They act as replication units. When new nodes are added, YugabyteDB moves tablet peers to rebalance transparently using Raft, and this is done in small increments so that new nodes can immediately start relieving some of the activity from the busy nodes.
In SQL databases, one specific aspect is that they have to accept transactional writes before knowing their visibility timestamp, which is only known after the COMMIT. Traditional databases need to revisit the blocks after the transaction is committed, for example, with actions like VACUUM in PostgreSQL or delayed block cleanout in Oracle. YugabyteDB addresses this issue differently by using two LSM Trees per tablet: IntentsDB and RegularDB. IntentsDB holds the provisional records, with their changes and lock information. These become visible to other transactions once the transaction status is set to committed in the transaction table, which is also a distributed and replicated tablet.
In the background, the committed intents are written to RegularDB with their definitive commit timestamp, so that reading from it doesn't require visiting the transaction table. These writes do not need to be protected by WAL because they are still covered by the higher-level WAL of logical change records Raft log. YugabyteDB ensures that the RegularDB MemTable is flushed before IntentsDB is flushed within each tablet so that WAL can be reclaimed safely. Typically, there's a small IntentsDB in each tablet, which is mostly in memory except for very long transactions, and the definitive RegularDB, which is optimized for long-term use. IntentsDB and RegularDB are compacted independently when the number of SST files increases. IntentsDB gets its provisional records removed when applied to RegularDB. Intermediate versions are removed when RegularDB SST files are compacted after the maximum MVCC retention. This design combines the advantages of other MVCC implementations, such as fast rollback (like PostgreSQL), no VACUUM (like Oracle), and additionally keeping row versions organized by key for fast queries in all isolation levels.
When data is being read from both IntentsDB and RegularDB, a custom RocksDB iterator is used to merge the data from these LSM Trees. There are several optimizations carried out to minimize the number of files that need to be read. In SQL databases, there are generally more read operations than write operations, so it is practical to include some additional metadata when writing to speed up the read operations. For instance, maintaining a bloom filter for each SST file helps skip reading from files that do not contain the queried key. In the case of YugabyteDB, this has been extended to include prefixes in the key, especially useful for composite keys and partial predicates on it. Another example involves storing the minimum and maximum values for each column, similar to a storage index or zone map, to skip reading files that are outside the queried range. As each RocksDB belongs to one table or one index, this metadata is optimized for each key.
Similar to traditional databases, the block cache is optimized to ensure that full table scans do not significantly wipe out the cache. Additionally, the RocksDB block cache in YugabyteDB is global rather than per-RocksDB, and shared across all tablets. While MemTables are specific to each tablet, they also adhere to a global limit, allowing optimal use of the node's resources and maintaining individual LSM trees for each tablet.
Here is the documentation about those performance enhancements:
RocksDB and YugabyteDB have different ways of handling keys. In RocksDB, keys are stored with a sequence, while in YugabyteDB, versions are sequenced with the Hybrid Logical Time timestamp, the same timestamp used for Multi-Version Concurrency Control. With MVCC, point queries may read a range to find the correct version based on the transaction read time. Keys often contain multiple columns, such as a composite primary key for the table or indexed columns for indexes, including the encoded primary key when it is a non-unique secondary index. To deal with large keys, Yugabyte has implemented advanced delta encoding to significantly reduce the key size without adding overhead. YugabyteDB is open source and the architecture design documentation is public.
Conclusion
The B-Tree index remains relevant in monolithic databases because it compensates for write amplification with a large buffer cache and maintains its efficient read performance on SSDs. it's important to note that even with a very large buffer cache, extra attention needs to be paid by the application when dealing with values that are scattered across the datatype range, such as random UUIDs, as random I/O is unavoidable. The efficiency of the B-Tree index also depends on the implementation of MVCC. In some databases, regular index rebuilding is necessary to reclaim space, while in others, it is not required for the most common access patterns. However, the scalability of a large shared buffer pool is limited because RAM cannot be shared by multiple instances in a cloud network. Distributed cloud-native databases often use the LSM Tree because it allows for efficient writes, even with a small MemTable, as it transforms random writes into sequential flushes, and reading from multiple SST files is acceptable with the quick random read capabilities of SSDs.
RocksDB is a strong, high-performance implementation of LSM Tree used in several databases. However, YugabyteDB does not use RocksDB in its original form. The codebase has been extensively modified to optimize performance for Distributed SQL. These modifications, along with sharding at the level of key-value logical change records, enable YugabyteDB to scale out with elasticity and high availability.