Deep dive Apache Parquet
As a data engineer everyone must have heard of parquet file format, which is most used when we use Apache Spark or any other bigdata tools. As a high level we know like parquet is efficient for reading analytical data and it uses columnar storage.
When I was working on nested data I was wondering how the parquet handles these deeply nested data and stores for fast retrieval. This made me to investigate the same in the weekend
Intro to parquet
Parquet is a column oriented data file format. It is mainly designed for efficiently storing and retrieving data. Also provides high performance compression and encoding schemes for complex and simple data type
Building blocks of parquet file
A row group contains exactly one column chunk per column
In the above figure 1 we can see how the layout looks like inside a parquet file. Parquet writes all the data first and then final writes the FileMetaData. We will deep dive into some of the concepts
Metadata
We will start with the metadata. There are 3 important metadata
From the above image we can see that?FileMetaData?has all the reference from the sub meta data shared by?RowGroup, ColumnChunk, ColumnMetaData and SchemaElement.?Not much details are shared in the docs for all these attributes but we can figure out from the column names
DataPageHeader?has the definition and repetition levels which will be used for nested data types, will be discussed below in the section Data Pages
Nested encoding is achieved by the dremel algorithm. This is what has brought us here so lets deep dive into it
Dremel algorithm
This paper talks about storing and retrieving complex nested data efficiently. Following challenges are being answered
Sample records
#record 1
DocId: 10
Links
??Forward: 20
??Forward: 40
??Forward: 60
Name
??Language
????Code: 'en-us'
????Country: 'us'
??Language
????Code: 'en'
??Url: 'https://A'
Name
??Url: 'https://B'
Name
??Language
????Code: 'en-gb'
????Country: 'gb'
#record 2
DocId: 20
Links
??Backward: 10
??Backward: 30
??Forward: 80
Name
??Url: 'https://C'
Schema of the sample records
# Schema
message Document {
?required int64 DocId;
?optional group Links {
??repeated int64 Backward;
??repeated int64 Forward; }
?repeated group Name {
??repeated group Language {
???required string Code;
???optional string Country; }
??optional string Url; }}
Repetition levels
Rules
Assigning repetition levels for record 1
领英推荐
Definition levels
Rules
Assigning the definition levels
Column striping algorithm
Record assembly
Phew that's a lot to digest but quite interesting. For those who want to read complete paper can refer here dremel
Lets continue with parquet's other important features
Data Pages
It will have three piece of information
Encoded values are always required. Repetition and definition are considered only when we have nested columns
Column chunks
These are composed of pages which are written back to back. The pages share a common header which can be used to skip the pages. These common headers will have the compression and encoding details
For ex - take dictionary encoding. This will build dictionary of values encountered in a column, this will be stored in a dictionary page per column chunk
Bloom filter
This itself is an algorithm (May be we can see in a different article)
We can use the column statistics and dictionaries for predicate push down. Statistics will have a min max value, which can be used to filter values not in range. Some case when we have high cardinality data both of these might not work. In those case bloom filter is used.
Bloom filter is a probabilistic data structure used to test if an element is a member of a set. The output of bloom is “possibly in the set” or “definitely not in the set”. There is no?false negatives?when using bloom filter
Page index
There are 2 attributes added to the row group metadata
Some of the other topics which itself is a huge topic to be discussed
Conclusion
Sometimes we use the tools with the abstraction that is given to us, at some point of using the tool extensively we end in a situation where we need to optimize it to its fullest performance. In order to do so we will have to dig deep into the implementation. Happy learning