Apache Parquet – A Deep Dive into Internal Architecture & Advantages
https://www.parquetreader.in/

Apache Parquet – A Deep Dive into Internal Architecture & Advantages

Apache Parquet is a high-performance, columnar storage format that is widely used in big data analytics and distributed computing frameworks like Apache Spark. It is designed to optimize storage efficiency and query performance while overcoming the limitations of row-based formats like CSV and JSON.


A. Why Was Parquet Created?

Before understanding how Parquet works, let’s discuss why it was developed.

Traditional file formats like CSV, JSON store data row by row. While this is efficient for transactional databases, it poses major challenges for analytical queries at scale.

1. Unnecessary I/O Reads

Imagine a dataset with 50 columns and 100 million rows. Now, consider this query:

SELECT name, age FROM customers;
        

Problem in row-based formats:

  • Even though we need only 2 columns, all 50 columns are read into memory.
  • This leads to wasted disk I/O, memory, and CPU cycles.


2. Inefficient Compression

  • Row-based storage mixes different data types (integers, strings, dates) together.
  • This prevents efficient compression, as different types require different compression strategies.
  • Columnar formats (like Parquet), on the other hand, store similar values together, leading to better compression and faster queries.


3. Slow Filtering (No Predicate Pushdown)

Consider this query:

SELECT * FROM customers WHERE age > 30;
        

Problem in row-based formats:

  • The query reads every row, even if only a few rows match the condition.
  • This leads to unnecessary scans, increasing query execution time.

Columnar formats like Parquet solve this issue using Predicate Pushdown, where irrelevant rows are skipped upfront, making queries much faster.


4. Schema Evolution is Hard

  • Adding a new column to a CSV file? Older software may fail to process it correctly.
  • Schema changes in row-based formats can break existing pipelines.
  • Parquet supports schema evolution, allowing new columns without breaking older queries.


B. Why Use Parquet?

Parquet was designed to overcome these limitations. It provides several advantages:

(1) Columnar Storage → Faster Queries & Less I/O

  • In Parquet, data is stored column-wise, not row-wise.
  • Example dataset:

ID Name Age Salary 1 Alice 25 50000 2 Bob 30 60000 3 Charlie 35 70000 4 David 40 80000

  • Stored in Parquet format:

ID: [1, 2, 3, 4]
Name: ["Alice", "Bob", "Charlie", "David"]
Age: [25, 30, 35, 40]
Salary: [50000, 60000, 70000, 80000]
        

  • Now, if a query needs only Age & Salary, it skips "ID" and "Name".
  • This reduces disk I/O and memory usage significantly.

Real-World Example

A banking application storing millions of customer records in Parquet.

  • A query like: SELECT customer_name, account_balance FROM customers;
  • In CSV, the query reads ALL columns (even unused ones like address, phone number, email).
  • In Parquet, it reads only the necessary columns50-80% less disk I/O.


(2) Efficient Compression → Smaller Storage Size

Parquet applies multiple compression techniques for better space efficiency.

(a) Dictionary Encoding

  • Instead of storing repeated values, it stores unique values in a dictionary.
  • Example for City column:

Original Data: ["New York", "London", "Paris", "London", "New York"]
Dictionary: {0 → "New York", 1 → "London", 2 → "Paris"}
Stored as: [0, 1, 2, 1, 0]  (uses less space)
        

(b) Run-Length Encoding (RLE)

  • Compresses repeated values efficiently.
  • Example: Original: [25, 25, 25, 30, 30, 35]
  • RLE Encoded: (25,3), (30,2), (35,1)
  • Used for integer and boolean columns.


Real-World Example

A telecom company storing call records.

  • CSV file size500GB
  • Parquet file size80GB
  • 80-85% storage savings!


(3) Predicate Pushdown → Faster Filtering

  • Parquet stores min/max values for each column.
  • Before scanning data, it skips irrelevant data blocks.

Example

Query:

SELECT * FROM customers WHERE Age > 30;
        

  • Parquet checks metadata: Age Column Metadata: min=25, max=45
  • If a row group has (min=20, max=28)Skipped!
  • Result: 50-90% faster filtering.


C. Internal Architecture of a Parquet File

Parquet organizes data into multiple layers for optimized storage and retrieval:

Parquet File Structure

+-------------------------------------------+
| File Header (Magic Number "PAR1")        |
+-------------------------------------------+
| Row Group 1                              |
|   ├── Column Chunk 1 (Compressed)        |
|   │     ├── Page 1                       |
|   │     ├── Page 2                       |
|   ├── Column Chunk 2 (Compressed)        |
|         ├── Page 1                       |
|         ├── Page 2                       |
| ...                                       |
+-------------------------------------------+
| Row Group 2                              |
|   ├── Column Chunk 1                     |
|   ├── Column Chunk 2                     |
| ...                                       |
+-------------------------------------------+
| File Footer (Metadata & Schema)          |
+-------------------------------------------+
        

How Data is Stored in Parquet?

1. Row Groups (Logical division of data)

Parquet divides large datasets into row groups, where each row group contains multiple rows but is stored column-wise.

Example Dataset:

ID   | Name    | Age  | City      
1    | Alice   | 25   | New York  
2    | Bob     | 30   | London  
3    | Charlie | 35   | Paris  
4    | David   | 40   | Berlin  
5    | Eva     | 45   | Tokyo  
        

With a row group size of 3, data is split as:

Row Group 1:

1, Alice, 25, New York  
2, Bob, 30, London  
3, Charlie, 35, Paris  
        

Row Group 2:

4, David, 40, Berlin  
5, Eva, 45, Tokyo  
        

Row groups allow parallel processing, making queries significantly faster.


2. Column Chunks (Columnar Storage)

Each row group is further divided into column chunks, where each column is stored separately.

Row Group 1’s Column Chunks:

ID   → [1, 2, 3]  
Name → ["Alice", "Bob", "Charlie"]  
Age  → [25, 30, 35]  
City → ["New York", "London", "Paris"]  
        

Row Group 2’s Column Chunks:

ID   → [4, 5]  
Name → ["David", "Eva"]  
Age  → [40, 45]  
City → ["Berlin", "Tokyo"]  
        

Why is this beneficial?

  • Reads only relevant columns → Reducing I/O
  • Better compression → Similar values stored together
  • Faster queries → Only necessary data is accessed


3. Pages (Smallest Storage Units in Parquet)

Each column chunk is divided into pages, which are the fundamental storage blocks.

Parquet uses different page types for optimization:

  • Data Page → Stores actual column values
  • Dictionary Page → Stores unique values for dictionary encoding
  • Index Page → Stores min/max statistics for faster filtering

Example: The "Age" Column in Row Group 1

[25, 30, 35]
        

Using Run-Length Encoding (RLE):

(25, 1), (30, 1), (35, 1) → Optimized storage  
        

This reduces file size and speeds up queries.


4. Parquet Metadata: The Brain of the File

Parquet stores rich metadata in the footer for better performance.

  • Schema → Defines column names and data types
  • Row Group Info → Number of rows per group
  • Column Stats → Min/Max values (used for predicate pushdown)
  • Encoding & Compression Details

Example Metadata Structure:

{
  "schema": {
    "fields": [
      {"name": "ID", "type": "int"},
      {"name": "Name", "type": "string"},
      {"name": "Age", "type": "int"},
      {"name": "City", "type": "string"}
    ]
  },
  "row_groups": [
    {"num_rows": 3, "columns": [
      {"name": "ID", "encoding": "RLE"},
      {"name": "Name", "encoding": "Dictionary"},
      {"name": "Age", "encoding": "Bit-Packing"},
      {"name": "City", "encoding": "Dictionary"}
    ]}
  ]
}
        

Metadata makes Parquet files self-descriptive, schema-aware, and highly optimized.


D. Conclusion

Apache Parquet is a highly optimized columnar file format that improves storage efficiency and query speed. With its columnar storage, encoding, compression, and metadata optimizations, Parquet is the best choice for big data analytics.

Key Benefits of Parquet

  • Faster queries (reads only relevant columns)
  • Better compression (dictionary encoding, RLE, bit-packing)
  • Schema evolution (can add new columns without breaking old data)
  • Efficient filtering (predicate pushdown skips unnecessary data)


Shashwat Mishra

Data Analyst @Ericsson Python || Django || Tableau || Automation || SQL || ETL || GCP BQ || DBMS

2 周

Excellent article ! Reading about parquet for the first time - but you made it look easy.

要查看或添加评论,请登录

Pragya Jaiswal的更多文章

社区洞察

其他会员也浏览了