Apache Parquet – A Deep Dive into Internal Architecture & Advantages
Pragya Jaiswal
Data Engineer | Ex- ZS | Big Data | Spark | SQL | Hive | Databricks | ADLS | ADF | Pyspark
Apache Parquet is a high-performance, columnar storage format that is widely used in big data analytics and distributed computing frameworks like Apache Spark. It is designed to optimize storage efficiency and query performance while overcoming the limitations of row-based formats like CSV and JSON.
A. Why Was Parquet Created?
Before understanding how Parquet works, let’s discuss why it was developed.
Traditional file formats like CSV, JSON store data row by row. While this is efficient for transactional databases, it poses major challenges for analytical queries at scale.
1. Unnecessary I/O Reads
Imagine a dataset with 50 columns and 100 million rows. Now, consider this query:
SELECT name, age FROM customers;
Problem in row-based formats:
2. Inefficient Compression
3. Slow Filtering (No Predicate Pushdown)
Consider this query:
SELECT * FROM customers WHERE age > 30;
Problem in row-based formats:
Columnar formats like Parquet solve this issue using Predicate Pushdown, where irrelevant rows are skipped upfront, making queries much faster.
4. Schema Evolution is Hard
B. Why Use Parquet?
Parquet was designed to overcome these limitations. It provides several advantages:
(1) Columnar Storage → Faster Queries & Less I/O
ID Name Age Salary 1 Alice 25 50000 2 Bob 30 60000 3 Charlie 35 70000 4 David 40 80000
ID: [1, 2, 3, 4]
Name: ["Alice", "Bob", "Charlie", "David"]
Age: [25, 30, 35, 40]
Salary: [50000, 60000, 70000, 80000]
Real-World Example
A banking application storing millions of customer records in Parquet.
(2) Efficient Compression → Smaller Storage Size
Parquet applies multiple compression techniques for better space efficiency.
(a) Dictionary Encoding
Original Data: ["New York", "London", "Paris", "London", "New York"]
Dictionary: {0 → "New York", 1 → "London", 2 → "Paris"}
Stored as: [0, 1, 2, 1, 0] (uses less space)
(b) Run-Length Encoding (RLE)
Real-World Example
A telecom company storing call records.
(3) Predicate Pushdown → Faster Filtering
Example
Query:
领英推荐
SELECT * FROM customers WHERE Age > 30;
C. Internal Architecture of a Parquet File
Parquet organizes data into multiple layers for optimized storage and retrieval:
Parquet File Structure
+-------------------------------------------+
| File Header (Magic Number "PAR1") |
+-------------------------------------------+
| Row Group 1 |
| ├── Column Chunk 1 (Compressed) |
| │ ├── Page 1 |
| │ ├── Page 2 |
| ├── Column Chunk 2 (Compressed) |
| ├── Page 1 |
| ├── Page 2 |
| ... |
+-------------------------------------------+
| Row Group 2 |
| ├── Column Chunk 1 |
| ├── Column Chunk 2 |
| ... |
+-------------------------------------------+
| File Footer (Metadata & Schema) |
+-------------------------------------------+
How Data is Stored in Parquet?
1. Row Groups (Logical division of data)
Parquet divides large datasets into row groups, where each row group contains multiple rows but is stored column-wise.
Example Dataset:
ID | Name | Age | City
1 | Alice | 25 | New York
2 | Bob | 30 | London
3 | Charlie | 35 | Paris
4 | David | 40 | Berlin
5 | Eva | 45 | Tokyo
With a row group size of 3, data is split as:
Row Group 1:
1, Alice, 25, New York
2, Bob, 30, London
3, Charlie, 35, Paris
Row Group 2:
4, David, 40, Berlin
5, Eva, 45, Tokyo
Row groups allow parallel processing, making queries significantly faster.
2. Column Chunks (Columnar Storage)
Each row group is further divided into column chunks, where each column is stored separately.
Row Group 1’s Column Chunks:
ID → [1, 2, 3]
Name → ["Alice", "Bob", "Charlie"]
Age → [25, 30, 35]
City → ["New York", "London", "Paris"]
Row Group 2’s Column Chunks:
ID → [4, 5]
Name → ["David", "Eva"]
Age → [40, 45]
City → ["Berlin", "Tokyo"]
Why is this beneficial?
3. Pages (Smallest Storage Units in Parquet)
Each column chunk is divided into pages, which are the fundamental storage blocks.
Parquet uses different page types for optimization:
Example: The "Age" Column in Row Group 1
[25, 30, 35]
Using Run-Length Encoding (RLE):
(25, 1), (30, 1), (35, 1) → Optimized storage
This reduces file size and speeds up queries.
4. Parquet Metadata: The Brain of the File
Parquet stores rich metadata in the footer for better performance.
Example Metadata Structure:
{
"schema": {
"fields": [
{"name": "ID", "type": "int"},
{"name": "Name", "type": "string"},
{"name": "Age", "type": "int"},
{"name": "City", "type": "string"}
]
},
"row_groups": [
{"num_rows": 3, "columns": [
{"name": "ID", "encoding": "RLE"},
{"name": "Name", "encoding": "Dictionary"},
{"name": "Age", "encoding": "Bit-Packing"},
{"name": "City", "encoding": "Dictionary"}
]}
]
}
Metadata makes Parquet files self-descriptive, schema-aware, and highly optimized.
D. Conclusion
Apache Parquet is a highly optimized columnar file format that improves storage efficiency and query speed. With its columnar storage, encoding, compression, and metadata optimizations, Parquet is the best choice for big data analytics.
Key Benefits of Parquet
Data Analyst @Ericsson Python || Django || Tableau || Automation || SQL || ETL || GCP BQ || DBMS
2 周Excellent article ! Reading about parquet for the first time - but you made it look easy.