登录查看更多内容

Apache Parquet – A Deep Dive into Internal Architecture & Advantages

Pragya Jaiswal

Data Engineer | Ex- ZS | Big Data | Spark | SQL | Hive | Databricks | ADLS | ADF | Pyspark

发布日期: 2025年3月13日

Apache Parquet is a high-performance, columnar storage format that is widely used in big data analytics and distributed computing frameworks like Apache Spark. It is designed to optimize storage efficiency and query performance while overcoming the limitations of row-based formats like CSV and JSON.

A. Why Was Parquet Created?

Before understanding how Parquet works, let’s discuss why it was developed.

Traditional file formats like CSV, JSON store data row by row. While this is efficient for transactional databases, it poses major challenges for analytical queries at scale.

1. Unnecessary I/O Reads

Imagine a dataset with 50 columns and 100 million rows. Now, consider this query:

SELECT name, age FROM customers;

Problem in row-based formats:

Even though we need only 2 columns, all 50 columns are read into memory.
This leads to wasted disk I/O, memory, and CPU cycles.

2. Inefficient Compression

Row-based storage mixes different data types (integers, strings, dates) together.
This prevents efficient compression, as different types require different compression strategies.
Columnar formats (like Parquet), on the other hand, store similar values together, leading to better compression and faster queries.

3. Slow Filtering (No Predicate Pushdown)

Consider this query:

SELECT * FROM customers WHERE age > 30;

Problem in row-based formats:

The query reads every row, even if only a few rows match the condition.
This leads to unnecessary scans, increasing query execution time.

Columnar formats like Parquet solve this issue using Predicate Pushdown, where irrelevant rows are skipped upfront, making queries much faster.

4. Schema Evolution is Hard

Adding a new column to a CSV file? Older software may fail to process it correctly.
Schema changes in row-based formats can break existing pipelines.
Parquet supports schema evolution, allowing new columns without breaking older queries.

B. Why Use Parquet?

Parquet was designed to overcome these limitations. It provides several advantages:

(1) Columnar Storage → Faster Queries & Less I/O

In Parquet, data is stored column-wise, not row-wise.
Example dataset:

ID Name Age Salary 1 Alice 25 50000 2 Bob 30 60000 3 Charlie 35 70000 4 David 40 80000

Stored in Parquet format:

ID: [1, 2, 3, 4]
Name: ["Alice", "Bob", "Charlie", "David"]
Age: [25, 30, 35, 40]
Salary: [50000, 60000, 70000, 80000]

Now, if a query needs only Age & Salary, it skips "ID" and "Name".
This reduces disk I/O and memory usage significantly.

Real-World Example

A banking application storing millions of customer records in Parquet.

A query like: SELECT customer_name, account_balance FROM customers;
In CSV, the query reads ALL columns (even unused ones like address, phone number, email).
In Parquet, it reads only the necessary columns → 50-80% less disk I/O.

(2) Efficient Compression → Smaller Storage Size

Parquet applies multiple compression techniques for better space efficiency.

(a) Dictionary Encoding

Instead of storing repeated values, it stores unique values in a dictionary.
Example for City column:

Original Data: ["New York", "London", "Paris", "London", "New York"]
Dictionary: {0 → "New York", 1 → "London", 2 → "Paris"}
Stored as: [0, 1, 2, 1, 0]  (uses less space)

(b) Run-Length Encoding (RLE)

Compresses repeated values efficiently.
Example: Original: [25, 25, 25, 30, 30, 35]
RLE Encoded: (25,3), (30,2), (35,1)
Used for integer and boolean columns.

Real-World Example

A telecom company storing call records.

CSV file size → 500GB
Parquet file size → 80GB
80-85% storage savings!

(3) Predicate Pushdown → Faster Filtering

Parquet stores min/max values for each column.
Before scanning data, it skips irrelevant data blocks.

Example

Query:

领英推荐

Deep Dive into Dremio's File-based Auto Ingestion into…

Alex Merced 4 个月前

A Deep Intro to Apache Iceberg and Resources for…

Alex Merced 11 个月前

Setting Up an Open Lakehouse on your Laptop for…

Alex Merced 1 年前

SELECT * FROM customers WHERE Age > 30;

Parquet checks metadata: Age Column Metadata: min=25, max=45
If a row group has (min=20, max=28) → Skipped!
Result: 50-90% faster filtering.

C. Internal Architecture of a Parquet File

Parquet organizes data into multiple layers for optimized storage and retrieval:

Parquet File Structure

+-------------------------------------------+
| File Header (Magic Number "PAR1")        |
+-------------------------------------------+
| Row Group 1                              |
|   ├── Column Chunk 1 (Compressed)        |
|   │     ├── Page 1                       |
|   │     ├── Page 2                       |
|   ├── Column Chunk 2 (Compressed)        |
|         ├── Page 1                       |
|         ├── Page 2                       |
| ...                                       |
+-------------------------------------------+
| Row Group 2                              |
|   ├── Column Chunk 1                     |
|   ├── Column Chunk 2                     |
| ...                                       |
+-------------------------------------------+
| File Footer (Metadata & Schema)          |
+-------------------------------------------+

How Data is Stored in Parquet?

1. Row Groups (Logical division of data)

Parquet divides large datasets into row groups, where each row group contains multiple rows but is stored column-wise.

Example Dataset:

ID   | Name    | Age  | City      
1    | Alice   | 25   | New York  
2    | Bob     | 30   | London  
3    | Charlie | 35   | Paris  
4    | David   | 40   | Berlin  
5    | Eva     | 45   | Tokyo

With a row group size of 3, data is split as:

Row Group 1:

1, Alice, 25, New York  
2, Bob, 30, London  
3, Charlie, 35, Paris

Row Group 2:

4, David, 40, Berlin  
5, Eva, 45, Tokyo

Row groups allow parallel processing, making queries significantly faster.

2. Column Chunks (Columnar Storage)

Each row group is further divided into column chunks, where each column is stored separately.

Row Group 1’s Column Chunks:

ID   → [1, 2, 3]  
Name → ["Alice", "Bob", "Charlie"]  
Age  → [25, 30, 35]  
City → ["New York", "London", "Paris"]

Row Group 2’s Column Chunks:

ID   → [4, 5]  
Name → ["David", "Eva"]  
Age  → [40, 45]  
City → ["Berlin", "Tokyo"]

Why is this beneficial?

Reads only relevant columns → Reducing I/O
Better compression → Similar values stored together
Faster queries → Only necessary data is accessed

3. Pages (Smallest Storage Units in Parquet)

Each column chunk is divided into pages, which are the fundamental storage blocks.

Parquet uses different page types for optimization:

Data Page → Stores actual column values
Dictionary Page → Stores unique values for dictionary encoding
Index Page → Stores min/max statistics for faster filtering

Example: The "Age" Column in Row Group 1

[25, 30, 35]

Using Run-Length Encoding (RLE):

(25, 1), (30, 1), (35, 1) → Optimized storage

This reduces file size and speeds up queries.

4. Parquet Metadata: The Brain of the File

Parquet stores rich metadata in the footer for better performance.

Schema → Defines column names and data types
Row Group Info → Number of rows per group
Column Stats → Min/Max values (used for predicate pushdown)
Encoding & Compression Details

Example Metadata Structure:

{
  "schema": {
    "fields": [
      {"name": "ID", "type": "int"},
      {"name": "Name", "type": "string"},
      {"name": "Age", "type": "int"},
      {"name": "City", "type": "string"}
    ]
  },
  "row_groups": [
    {"num_rows": 3, "columns": [
      {"name": "ID", "encoding": "RLE"},
      {"name": "Name", "encoding": "Dictionary"},
      {"name": "Age", "encoding": "Bit-Packing"},
      {"name": "City", "encoding": "Dictionary"}
    ]}
  ]
}

Metadata makes Parquet files self-descriptive, schema-aware, and highly optimized.

D. Conclusion

Apache Parquet is a highly optimized columnar file format that improves storage efficiency and query speed. With its columnar storage, encoding, compression, and metadata optimizations, Parquet is the best choice for big data analytics.

Key Benefits of Parquet

Faster queries (reads only relevant columns)
Better compression (dictionary encoding, RLE, bit-packing)
Schema evolution (can add new columns without breaking old data)
Efficient filtering (predicate pushdown skips unnecessary data)

Shashwat Mishra

Data Analyst @Ericsson Python || Django || Tableau || Automation || SQL || ETL || GCP BQ || DBMS

2 周

Excellent article ! Reading about parquet for the first time - but you made it look easy.

1 次回应

要查看或添加评论，请登录

Pragya Jaiswal的更多文章

The Evolution of Data Storage: From Data Lakes to Data Lakehouse

2025年2月3日

The Evolution of Data Storage: From Data Lakes to Data Lakehouse

Data storage and analytics have evolved significantly over the years. Initially, traditional data warehouses were the…

1 条评论
?? Data Warehouses: The Good, the Bad, and the Future of Data Analytics ??

2025年2月3日

?? Data Warehouses: The Good, the Bad, and the Future of Data Analytics ??

Data warehouses have been the backbone of data analytics for decades, enabling organizations to store, manage, and…

1 条评论
Idempotency: The Backbone of Trustworthy Data Systems

2025年1月31日

Idempotency: The Backbone of Trustworthy Data Systems

What is Idempotency? An operation is idempotent if performing it multiple times produces the same result as performing…
Building a Dynamic and Scalable Player Data Model in PostgreSQL

2025年1月20日

Building a Dynamic and Scalable Player Data Model in PostgreSQL

In the ever-evolving field of sports analytics, maintaining an efficient database to manage player statistics, teams…
The Evolution of the Data Lakehouse: From Warehouses and Lakes to Unified Data Systems

2024年8月17日

The Evolution of the Data Lakehouse: From Warehouses and Lakes to Unified Data Systems

In today’s data-driven world, businesses are constantly finding new ways to manage their growing amounts of data. For…

See all articles

Apache Parquet – A Deep Dive into Internal Architecture & Advantages

Pragya Jaiswal

Data Engineer | Ex- ZS | Big Data | Spark | SQL | Hive | Databricks | ADLS | ADF | Pyspark

A. Why Was Parquet Created?

1. Unnecessary I/O Reads

2. Inefficient Compression

3. Slow Filtering (No Predicate Pushdown)

4. Schema Evolution is Hard

B. Why Use Parquet?

(1) Columnar Storage → Faster Queries & Less I/O

Real-World Example

(2) Efficient Compression → Smaller Storage Size

(a) Dictionary Encoding

(b) Run-Length Encoding (RLE)

Real-World Example

(3) Predicate Pushdown → Faster Filtering

Example

领英推荐

C. Internal Architecture of a Parquet File

How Data is Stored in Parquet?

1. Row Groups (Logical division of data)

2. Column Chunks (Columnar Storage)

3. Pages (Smallest Storage Units in Parquet)

4. Parquet Metadata: The Brain of the File

D. Conclusion

Key Benefits of Parquet

Pragya Jaiswal的更多文章

社区洞察

其他会员也浏览了

Parquet Architecture and Internals

Anatomy of Apache Spark's RDD

Spark Tidbits - Lesson 11

Why Schema-less JSON Databases Are So Useful

Apache Spark 101: Window Functions

Repartition and Coalesce in Apache Spark

Apache Spark 101: DataFrame Write API Operation

Dataframe Hints in Apache Spark

Solving Massive Data Latency with Dynamic Partitioning and Adaptive Query Execution in Apache Spark

Top 15 futuristic database you have never heard of

A. Why Was Parquet Created?

1. Unnecessary I/O Reads

2. Inefficient Compression

3. Slow Filtering (No Predicate Pushdown)

4. Schema Evolution is Hard

B. Why Use Parquet?

(1) Columnar Storage → Faster Queries & Less I/O

Real-World Example

(2) Efficient Compression → Smaller Storage Size

(a) Dictionary Encoding

(b) Run-Length Encoding (RLE)

Real-World Example

(3) Predicate Pushdown → Faster Filtering

Example

领英推荐

C. Internal Architecture of a Parquet File

How Data is Stored in Parquet?

1. Row Groups (Logical division of data)

2. Column Chunks (Columnar Storage)

3. Pages (Smallest Storage Units in Parquet)

4. Parquet Metadata: The Brain of the File

D. Conclusion

Key Benefits of Parquet

Pragya Jaiswal的更多文章

The Evolution of Data Storage: From Data Lakes to Data Lakehouse

?? Data Warehouses: The Good, the Bad, and the Future of Data Analytics ??

Idempotency: The Backbone of Trustworthy Data Systems

Building a Dynamic and Scalable Player Data Model in PostgreSQL

The Evolution of the Data Lakehouse: From Warehouses and Lakes to Unified Data Systems

社区洞察

其他会员也浏览了

Parquet Architecture and Internals

Anatomy of Apache Spark's RDD

Spark Tidbits - Lesson 11

Why Schema-less JSON Databases Are So Useful

Apache Spark 101: Window Functions

Repartition and Coalesce in Apache Spark

Apache Spark 101: DataFrame Write API Operation

Dataframe Hints in Apache Spark

Solving Massive Data Latency with Dynamic Partitioning and Adaptive Query Execution in Apache Spark

Top 15 futuristic database you have never heard of