Series: Introduction to Columnar Databases

Series: Introduction to Columnar Databases

1. Columnar Databases:

- A?columnar database?organizes data in a column-wise format rather than the traditional row-wise format used by row-based databases.

- Key points about columnar databases:

- Data is logically organized as a table with rows and columns but physically stored in a column-wise format.

- Since columns typically contain similar data types, values, or properties, compression algorithms can achieve higher compression ratios compared to row-oriented databases.

- Benefits for Analytics:

- Query Performance: Columnar databases excel in analytical queries due to efficient column-wise processing, resulting in up to 10x query performance gains over row-oriented storage. Analytical queries often involve aggregating or analyzing a subset of columns rather than entire rows.

- Data Compression: Achieve up to 10x data compression over uncompressed data size.

2. Log-Structured Merge (LSM) Tree:

- Columnar databases use LSM Tree, which allows for fast writes using a write-ahead log.

- Write-Ahead Log (WAL):

- When a client writes data, the system:

1. Writes a log entry to disk (WAL).

2. Returns to the client.

- Benefits:

- Recovery: If an exception occurs (e.g., power loss), data can be recovered from the log.

- Background Compaction:

- Compaction reduces the number of files and eliminates garbage from updates and deletes.

- Happens independently from reads and writes.

- LSM Tree Efficiency:

- Parallel writes during compaction.

- Automatic replays.

- Efficient for high-throughput scenarios (e.g., event processing) due to background compaction.

3. Memtable:

- A?memtable?is an in-memory data structure used by columnar databases (such as?Cassandra?and?ScyllaDB).

- Purpose:

- Temporarily stores recent write operations (inserts, updates, and deletes) before they are committed to disk.

- Acts as a write buffer, allowing for quick write operations.

- Ensures eventual persistence to disk in an orderly manner.

- Process:

- When data is written, it is first stored in the memtable.

- Once the memtable is full or certain conditions are met (e.g., commit log size threshold), it is flushed to disk as an?SSTable?(Sorted String Table).

- After flushing, the memtable is cleared, and a new cycle begins.

- Key Points:

- Memtables are sorted by key (usually the partition key).

- Each column family typically has a separate memtable.

- The data in the memtable is eventually persisted to disk as SSTables for durability.

4. SSTable (Sorted String Table):

- An?SSTable?is a?persistent file format?used by databases like?ScyllaDB?and?Cassandra.

- Purpose:

- SSTables serve as the building blocks of the total data stored in the database.

- They are?immutable, meaning updates create new SSTable files instead of modifying existing ones.

- Structure:

- For each SSTable, the database creates an?index file?and a?data file.

- The?index file?helps locate data quickly within the sorted data file.

- Key Points:

- SSTables are efficient for read-heavy workloads due to their ordered and immutable nature.

- They play a crucial role in ensuring data durability and efficient data retrieval in distributed databases.

要查看或添加评论,请登录

Anuj Yadav的更多文章

社区洞察