登录查看更多内容

Deep dive Apache Parquet

Hariharan G

Sr Data Engineer | pyspark | python | airflow | sql | snowflake | dbt

发布日期: 2024年6月2日

As a data engineer everyone must have heard of parquet file format, which is most used when we use Apache Spark or any other bigdata tools. As a high level we know like parquet is efficient for reading analytical data and it uses columnar storage.

When I was working on nested data I was wondering how the parquet handles these deeply nested data and stores for fast retrieval. This made me to investigate the same in the weekend

Intro to parquet

Parquet is a column oriented data file format. It is mainly designed for efficiently storing and retrieving data. Also provides high performance compression and encoding schemes for complex and simple data type

Building blocks of parquet file

Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset

Column chunk: A chunk of the data for a particular column. They live in a particular row group and are guaranteed to be contiguous in the file

Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which are interleaved in a column chunk

A row group contains exactly one column chunk per column

Figure1 - From parquet official documentation

In the above figure 1 we can see how the layout looks like inside a parquet file. Parquet writes all the data first and then final writes the FileMetaData. We will deep dive into some of the concepts

Metadata

We will start with the metadata. There are 3 important metadata

File metadata
Column chunk metadata
Page header metadata

Figure2 - From parquet official documentation

From the above image we can see that?FileMetaData?has all the reference from the sub meta data shared by?RowGroup, ColumnChunk, ColumnMetaData and SchemaElement.?Not much details are shared in the docs for all these attributes but we can figure out from the column names

DataPageHeader?has the definition and repetition levels which will be used for nested data types, will be discussed below in the section Data Pages

Nested encoding is achieved by the dremel algorithm. This is what has brought us here so lets deep dive into it

Dremel algorithm

This paper talks about storing and retrieving complex nested data efficiently. Following challenges are being answered

Loseless representation of record structure in columnar format
Fast encoding
Efficient record assembly

Sample records

#record 1
DocId: 10
Links
??Forward: 20
??Forward: 40
??Forward: 60
Name
??Language
????Code: 'en-us'
????Country: 'us'
??Language
????Code: 'en'
??Url: 'https://A'
Name
??Url: 'https://B'
Name
??Language
????Code: 'en-gb'
????Country: 'gb'

#record 2
DocId: 20
Links
??Backward: 10
??Backward: 30
??Forward: 80
Name
??Url: 'https://C'

Schema of the sample records

# Schema
message Document {
?required int64 DocId;
?optional group Links {
??repeated int64 Backward;
??repeated int64 Forward; }
?repeated group Name {
??repeated group Language {
???required string Code;
???optional string Country; }
??optional string Url; }}

Repetition levels

In nested data world values alone cant convey anything. Consider if we have 2 repeated field in the same record, we wont be able to figure out which occurred first and what level this is happening
Another example is a missing optional field in a record

Rules

Non repeated field do not require a repetition level
It tell us at what repeated field in the field’s path the value has repeated
Build a holistic structure combining both the record schemas

Assigning repetition levels for record 1

领英推荐

Top 10 Big Data Tools & Technologies To Watch Out In…

ITIO Innovex Pvt. Ltd. 10 个月前

Using Airbyte with Tabular

Tabular (now part of Databricks) 1 年前

January 2023 - Iceberg Community News

Tabular (now part of Databricks) 2 年前

assigning repetition levels for record 1

Definition levels

Rules

Each value of field p especially every NULL, has a definition level specifying how many fields in p that could be undefined(because repeated or optional) are actually present in the record

Assigning the definition levels

assigning definition levels for record 1

Column striping algorithm

Arranging data can be related to the depth-first traversing of a tree
Start from root of the record till you reach a leaf. Children are either a different field or different instance of a repeated field
When a leaf is reached write out the value, corresponding maximum definition level and current repetition level
When an empty field is reached, write no value along with the definition level for last defined level for all leaf nodes of the type below this node
When going back the tree, the level where the last repetition was is the current repetition level
When a field is not defined, the definition level will be smaller than the definition level of this field (We have seen this in the above table)

Record assembly

Lets say we need to just reconstruct the record with only one field
Re-assembling the record is doing the same traversal of the tree over again. Pick the column for the field you need and reconstruct the record as follows: When a definition level is lower than the max stop going down at this level, once a leaf has been reached go down to the repetition level and start going down from there with the next value

Phew that's a lot to digest but quite interesting. For those who want to read complete paper can refer here dremel

Lets continue with parquet's other important features

Data Pages

It will have three piece of information

repetition levels
definition levels
encoded values

Encoded values are always required. Repetition and definition are considered only when we have nested columns

Column chunks

These are composed of pages which are written back to back. The pages share a common header which can be used to skip the pages. These common headers will have the compression and encoding details

For ex - take dictionary encoding. This will build dictionary of values encountered in a column, this will be stored in a dictionary page per column chunk

Bloom filter

This itself is an algorithm (May be we can see in a different article)

We can use the column statistics and dictionaries for predicate push down. Statistics will have a min max value, which can be used to filter values not in range. Some case when we have high cardinality data both of these might not work. In those case bloom filter is used.

Bloom filter is a probabilistic data structure used to test if an element is a member of a set. The output of bloom is “possibly in the set” or “definitely not in the set”. There is no?false negatives?when using bloom filter

Page index

There are 2 attributes added to the row group metadata

Column index - This will enable moving the pages using the column values. This can locate the page which has the requested data for a scan predicate
Offsetindex -?allows navigation by row index and is used to retrieve values for rows identified as matches via the ColumnIndex. Once rows of a column are skipped, the corresponding rows in the other columns have to be skipped. Hence the OffsetIndexes for each column in a RowGroup are stored together

Some of the other topics which itself is a huge topic to be discussed

Compression
Encryption

Conclusion

Sometimes we use the tools with the abstraction that is given to us, at some point of using the tool extensively we end in a situation where we need to optimize it to its fullest performance. In order to do so we will have to dig deep into the implementation. Happy learning

要查看或添加评论，请登录

Hariharan G的更多文章

Unleash the power of SQLMesh

2024年6月17日

Unleash the power of SQLMesh

In the current data engineering space we many tools which are trending (duckdb, dbt, airbyte, polars, datafusion etc)…

2 条评论

Deep dive Apache Parquet

Hariharan G

Sr Data Engineer | pyspark | python | airflow | sql | snowflake | dbt

Intro to parquet

Building blocks of parquet file

Metadata

Dremel algorithm

领英推荐

Definition levels

Column striping algorithm

Record assembly

Data Pages

Column chunks

Bloom filter

Page index

Conclusion

Hariharan G的更多文章

社区洞察

其他会员也浏览了

The Most Popular Big Data Frameworks in 2022

The History and Evolution of Open Table Formats

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

“THE FUNDAMENTALS OF BIG DATA TOOLS: MapReduce, Spark, and Hive”

Working with Semi-Structured JSON Data in Databricks

Azure Synapse Spark Pool: PySpark Upsert Function for Azure SQL

Sometimes, You DON’T Really Need a Distributed System

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

Top 10 big data platforms – Part 1

Intro to parquet

Building blocks of parquet file

Metadata

Dremel algorithm

领英推荐

Definition levels

Column striping algorithm

Record assembly

Data Pages

Column chunks

Bloom filter

Page index

Conclusion

Hariharan G的更多文章

Unleash the power of SQLMesh

社区洞察

其他会员也浏览了

The Most Popular Big Data Frameworks in 2022

The History and Evolution of Open Table Formats

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

“THE FUNDAMENTALS OF BIG DATA TOOLS: MapReduce, Spark, and Hive”

Working with Semi-Structured JSON Data in Databricks

Azure Synapse Spark Pool: PySpark Upsert Function for Azure SQL

Sometimes, You DON’T Really Need a Distributed System

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

Top 10 big data platforms – Part 1