登录查看更多内容

Demystifying Columnar Storage Model

Rajiv Singh

SDE JL3 @Maersk | Ex @Lummo @redBus @Economize | GSoC '24 '23 Mentor @Jenkins | GSoC '22 @Keptn | LFX '21 @mojal global | GSoD '20 @gRPC-Gateway | HackerRank: 1908.5 | HackerEarth: 1625 | ICPC Regionalist Amritapuri 2020

发布日期: 2023年5月12日

Data compression is a crucial part of data storage and processing. The columnar storage model has gained popularity due to its efficient data compression techniques, especially in databases and SQL operations. In this article, we will discuss the advantages of the columnar storage model in data compression.

What is the Columnar Storage Model?

In traditional storage models, data is stored in rows, with each row containing all the fields or columns related to that row. On the other hand, in the columnar storage model, data is stored in columns, with each column containing all the values of that field across all rows.

For example, consider a table containing customer information, with columns such as customer name, age, address, and phone number. In a row-based storage model, the data for each customer would be stored in a row, with all the columns related to that customer in that row. In a columnar storage model, the customer name, age, address, and phone number would each be stored in separate columns, with all the values for each column stored together.

Benefits and Drawbacks of the Columnar Storage Model in Data Processing

Here are the benefits and drawbacks of using a columnar storage model in data processing:

Benefits:

Efficient compression: The columnar storage model stores repeated values within columns, making it easier to compress data using algorithms that work on repeating patterns. This results in a higher compression ratio compared to row-based storage models.
Improved query performance: All values for a particular column are stored together, making it easier to perform aggregate functions and queries involving a specific column. This leads to faster query processing and improved performance.
Faster I/O performance: Sequential I/O operations are easier to perform because the query engine only needs to access the data in the columns needed for the query, rather than reading through an entire row and filtering out irrelevant columns.
Parallel processing: The processing of data can be parallelized more easily since each column is independent of the others. This leads to faster query processing and reduced time required to complete a query.

Drawbacks:

Not suitable for unstructured data: The columnar storage model may not be effective for processing unstructured data such as text or images since there is no well-defined schema to organize the data into columns.
Slower updates and inserts: Adding or removing a single row in a columnar storage model can be more complex than in a row-based storage model, leading to slower updates and inserts.

Ultimately, the choice of storage model depends on the specific requirements of your data processing tasks. While the columnar storage model offers significant advantages in terms of data compression, query performance, and parallel processing, it may not be the ideal solution for every use case. Careful consideration of the data's structure, expected query patterns, and processing requirements can help determine whether the columnar storage model is the best fit for your needs.

Example of Columnar Storage Model in Action

To demonstrate the advantages of the columnar storage model in data compression, I wrote a simple Go program that generates a dataset of 100,000 records, each containing an integer?ID, a string?Name, a float?Price, and a boolean?InStock.

package main

import (
    "bytes"
    "encoding/gob"
    "fmt"
    "math/rand"
    "time"
)

type ProductRow struct {
    ID      int
    Name    string
    Price   float64
    InStock bool
}

type ProductColumn struct {
    IDs      []int
    Names    []string
    Prices   []float64
    InStocks []bool
}

func main() {
    // Generate a dataset of 100,000 records
    rand.Seed(time.Now().UnixNano())
    var rows []ProductRow
    for i := 0; i < 100000; i++ {
        row := ProductRow{
            ID:      i,
            Name:    fmt.Sprintf("Product %d", i),
            Price:   rand.Float64() * 100,
            InStock: rand.Intn(2) == 0,
        }
        rows = append(rows, row)
    }

    // Store the dataset as rows
    var bufRows bytes.Buffer
    encRows := gob.NewEncoder(&bufRows)
    encRows.Encode(rows)
    rowsSize := bufRows.Len()

    // Store the dataset as columns
    var columns ProductColumn
    for _, row := range rows {
        columns.IDs = append(columns.IDs, row.ID)
        columns.Names = append(columns.Names, row.Name)
        columns.Prices = append(columns.Prices, row.Price)
        columns.InStocks = append(columns.InStocks, row.InStock)
    }
    var bufColumns bytes.Buffer
    encColumns := gob.NewEncoder(&bufColumns)
    encColumns.Encode(columns)
    columnsSize := bufColumns.Len()

    // Print the size difference between rows and columns
    fmt.Printf("Rows size: %d bytes\n", rowsSize)
    fmt.Printf("Columns size: %d bytes\n", columnsSize)
    fmt.Printf("Compression ratio: %.2f%%\n", 100*float64(columnsSize)/float64(rowsSize))

    // Benchmark the encoding and decoding speed
    var decodedRows []ProductRow
    start := time.Now()
    for i := 0; i < 100; i++ {
        // Encode and decode rows
        var buf bytes.Buffer
        enc := gob.NewEncoder(&buf)
        enc.Encode(rows)
        dec := gob.NewDecoder(&buf)
        dec.Decode(&decodedRows)
    }
    elapsedRows := time.Since(start)

    var decodedColumns ProductColumn
    start = time.Now()
    for i := 0; i < 100; i++ {
        // Encode and decode columns
        var buf bytes.Buffer
        enc := gob.NewEncoder(&buf)
        enc.Encode(columns)
        dec := gob.NewDecoder(&buf)
        dec.Decode(&decodedColumns)
    }
    elapsedColumns := time.Since(start)

    fmt.Printf("Encoding and decoding rows took %s\n", elapsedRows)
    fmt.Printf("Encoding and decoding columns took %s\n", elapsedColumns)
}

When I run this program, it outputs the following:

Dubai Jobs, Gulf Jobs, Jobs in Dubai, Qatar, Kuwait - Boyen Haddin & The Giant HR Consultant 11 个月前

Structured Data vs Unstructured Data

Amr Saafan 2 年前

What is a Data Structure?

Vivek Chauhan 1 年前

Rows size: 3156545 bytes
Columns size: 2755676 bytes
Compression ratio: 87.30%
Encoding and decoding rows took 2.382370118s
Encoding and decoding columns took 1.345502811s

As we can see, the dataset stored as rows take 3,156,545 bytes, while the same dataset stored as columns takes only 2,755,676 bytes. This results in a compression ratio of 87.30%, which is a significant improvement in data compression.

In the current implementation, the data is serialized using the Gob encoding format. While Gob is a convenient format to use, it is not the most efficient serialization format in terms of space and time complexity. Consider using a more efficient serialization format, such as Protocol Buffers or MessagePack, which can reduce the size of the serialized data and improve encoding and decoding speed.

Overall, this example demonstrates the advantages of the columnar storage model in terms of data compression and highlights how it can be used to achieve a higher compression ratio than in a row-based storage model.

Optimization Techniques for Column Storage Model

Title: Optimization Techniques for Column Storage Model

Column storage is a popular technique used in data warehousing and analytics to optimize query performance and reduce storage requirements. To further optimize the column storage model, the following techniques can be used:

Use fixed-size arrays instead of variable-length arrays or lists to reduce memory overhead.
Use a struct of arrays instead of an array of structs to improve cache locality and reduce memory overhead.
Use a more efficient serialization format, such as Protocol Buffers or MessagePack, to reduce the size of the serialized data and improve encoding and decoding speed.
Use a compression algorithm to further reduce the size of the serialized data, especially when storing or transmitting the data over a network.
Benchmark and optimize the code using profiling tools and parallelization techniques to improve performance.

By implementing these optimization techniques, the column storage model can become more memory-efficient, faster, and better suited for large-scale data analysis.

Conclusion

To summarize, data compression is an essential part of data storage and processing, and the columnar storage model provides significant benefits in this area. Storing data in columns rather than rows enables efficient compression through algorithms that work well on repeating patterns within columns.

Although the columnar storage model is one of many compression techniques available, other methods like dictionary encoding, run-length encoding, and delta encoding each offer their unique advantages and disadvantages.

The choice of compression technique depends on the specific requirements and characteristics of the data being processed. However, the columnar storage model's benefits in data compression make it a compelling option for databases and SQL operations that require high levels of compression and efficient query performance.

Thank you ?? for taking the time ? to read this blog post ??. I hope you found the information ?? helpful and informative ??. If you have any questions ? or comments ??, please feel free to leave them below ??. Your feedback ?? is always appreciated.

????Portfolio????GitHub????LinkedIn????Twitter

Rajiv's Insights

1,015 位关注者

Erik Likness

3 个月

Rajiv - I really like the demonstration you shared as part of your 'column' on storage. Made it simple enough to share with colleagues.

要查看或添加评论，请登录

查看全部

Demystifying Columnar Storage Model

Rajiv Singh

SDE JL3 @Maersk | Ex @Lummo @redBus @Economize | GSoC '24 '23 Mentor @Jenkins | GSoC '22 @Keptn | LFX '21 @mojal global | GSoD '20 @gRPC-Gateway | HackerRank: 1908.5 | HackerEarth: 1625 | ICPC Regionalist Amritapuri 2020

What is the Columnar Storage Model?

Benefits and Drawbacks of the Columnar Storage Model in Data Processing

Example of Columnar Storage Model in Action

领英推荐

Optimization Techniques for Column Storage Model

Conclusion

Rajiv's Insights

1,015 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

The Growing Use Of Humanized Big Data

9 Key Reasons Causing Big Data Need

OPC UA over MQTT: Describing the Message Content

From Data to Decisions: The Journey of Data in Analytics

From Raw to Refined: Complex Data Transformation Strategies for Modern Enterprises

Power of Data Observability

Data is Power

Structured and Unstructured Data in Telecommunications

Understanding Data Sources and Characteristics

Liquid Clustering with DELTA 3.0 : Doctor Strange's spell within Data Engineering Multiverse

What is the Columnar Storage Model?

Benefits and Drawbacks of the Columnar Storage Model in Data Processing

Example of Columnar Storage Model in Action

领英推荐

Optimization Techniques for Column Storage Model

Conclusion

Rajiv's Insights

1,015 位关注者

Profiling WebAssembly with pprof and wzprof

2024年9月28日

Exploring Go 1.22 Release

2024年2月10日

Sliding Window Algorithmic Mental Model

2023年8月17日

API Gateways Load Balancers Reverse Proxies

2023年8月6日

Higher-Order Functions in Go

2023年6月4日

Optimizing Performance with Inlining

2023年5月22日

Demystifying Dangling Pointers

2023年5月20日

Improving Software Performance with Buffers

2023年5月11日

Getting Started with Google Summer of Code

2023年5月10日

A Comprehensive Guide to Learning Go Programming Language

2023年5月9日

社区洞察

其他会员也浏览了

The Growing Use Of Humanized Big Data

9 Key Reasons Causing Big Data Need

OPC UA over MQTT: Describing the Message Content

From Data to Decisions: The Journey of Data in Analytics

From Raw to Refined: Complex Data Transformation Strategies for Modern Enterprises

Power of Data Observability

Data is Power

Structured and Unstructured Data in Telecommunications

Understanding Data Sources and Characteristics

Liquid Clustering with DELTA 3.0 : Doctor Strange's spell within Data Engineering Multiverse