Deep dive Apache Parquet

Deep dive Apache Parquet

As a data engineer everyone must have heard of parquet file format, which is most used when we use Apache Spark or any other bigdata tools. As a high level we know like parquet is efficient for reading analytical data and it uses columnar storage.

When I was working on nested data I was wondering how the parquet handles these deeply nested data and stores for fast retrieval. This made me to investigate the same in the weekend

Intro to parquet

Parquet is a column oriented data file format. It is mainly designed for efficiently storing and retrieving data. Also provides high performance compression and encoding schemes for complex and simple data type

Building blocks of parquet file

  • Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset

  • Column chunk: A chunk of the data for a particular column. They live in a particular row group and are guaranteed to be contiguous in the file

  • Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which are interleaved in a column chunk

A row group contains exactly one column chunk per column

Figure1 - From parquet official documentation

In the above figure 1 we can see how the layout looks like inside a parquet file. Parquet writes all the data first and then final writes the FileMetaData. We will deep dive into some of the concepts

Metadata

We will start with the metadata. There are 3 important metadata

  • File metadata
  • Column chunk metadata
  • Page header metadata

Figure2 - From parquet official documentation


From the above image we can see that?FileMetaData?has all the reference from the sub meta data shared by?RowGroup, ColumnChunk, ColumnMetaData and SchemaElement.?Not much details are shared in the docs for all these attributes but we can figure out from the column names

DataPageHeader?has the definition and repetition levels which will be used for nested data types, will be discussed below in the section Data Pages

Nested encoding is achieved by the dremel algorithm. This is what has brought us here so lets deep dive into it

Dremel algorithm

This paper talks about storing and retrieving complex nested data efficiently. Following challenges are being answered

  • Loseless representation of record structure in columnar format
  • Fast encoding
  • Efficient record assembly

Sample records

#record 1
DocId: 10
Links
??Forward: 20
??Forward: 40
??Forward: 60
Name
??Language
????Code: 'en-us'
????Country: 'us'
??Language
????Code: 'en'
??Url: 'https://A'
Name
??Url: 'https://B'
Name
??Language
????Code: 'en-gb'
????Country: 'gb'

#record 2
DocId: 20
Links
??Backward: 10
??Backward: 30
??Forward: 80
Name
??Url: 'https://C'        

Schema of the sample records

# Schema
message Document {
?required int64 DocId;
?optional group Links {
??repeated int64 Backward;
??repeated int64 Forward; }
?repeated group Name {
??repeated group Language {
???required string Code;
???optional string Country; }
??optional string Url; }}        

Repetition levels

  • In nested data world values alone cant convey anything. Consider if we have 2 repeated field in the same record, we wont be able to figure out which occurred first and what level this is happening
  • Another example is a missing optional field in a record

Rules

  • Non repeated field do not require a repetition level
  • It tell us at what repeated field in the field’s path the value has repeated
  • Build a holistic structure combining both the record schemas

Assigning repetition levels for record 1

assigning repetition levels for record 1

Definition levels

Rules

  • Each value of field p especially every NULL, has a definition level specifying how many fields in p that could be undefined(because repeated or optional) are actually present in the record

Assigning the definition levels

image to be cont
assigning definition levels for record 1

Column striping algorithm

  • Arranging data can be related to the depth-first traversing of a tree
  • Start from root of the record till you reach a leaf. Children are either a different field or different instance of a repeated field
  • When a leaf is reached write out the value, corresponding maximum definition level and current repetition level
  • When an empty field is reached, write no value along with the definition level for last defined level for all leaf nodes of the type below this node
  • When going back the tree, the level where the last repetition was is the current repetition level
  • When a field is not defined, the definition level will be smaller than the definition level of this field (We have seen this in the above table)

Record assembly

  • Lets say we need to just reconstruct the record with only one field
  • Re-assembling the record is doing the same traversal of the tree over again. Pick the column for the field you need and reconstruct the record as follows: When a definition level is lower than the max stop going down at this level, once a leaf has been reached go down to the repetition level and start going down from there with the next value

Phew that's a lot to digest but quite interesting. For those who want to read complete paper can refer here dremel

Lets continue with parquet's other important features

Data Pages

It will have three piece of information

  • repetition levels
  • definition levels
  • encoded values

Encoded values are always required. Repetition and definition are considered only when we have nested columns

Column chunks

These are composed of pages which are written back to back. The pages share a common header which can be used to skip the pages. These common headers will have the compression and encoding details

For ex - take dictionary encoding. This will build dictionary of values encountered in a column, this will be stored in a dictionary page per column chunk

Bloom filter

This itself is an algorithm (May be we can see in a different article)

We can use the column statistics and dictionaries for predicate push down. Statistics will have a min max value, which can be used to filter values not in range. Some case when we have high cardinality data both of these might not work. In those case bloom filter is used.

Bloom filter is a probabilistic data structure used to test if an element is a member of a set. The output of bloom is “possibly in the set” or “definitely not in the set”. There is no?false negatives?when using bloom filter

Page index

There are 2 attributes added to the row group metadata

  • Column index - This will enable moving the pages using the column values. This can locate the page which has the requested data for a scan predicate
  • Offsetindex -?allows navigation by row index and is used to retrieve values for rows identified as matches via the ColumnIndex. Once rows of a column are skipped, the corresponding rows in the other columns have to be skipped. Hence the OffsetIndexes for each column in a RowGroup are stored together

Some of the other topics which itself is a huge topic to be discussed

  • Compression
  • Encryption

Conclusion

Sometimes we use the tools with the abstraction that is given to us, at some point of using the tool extensively we end in a situation where we need to optimize it to its fullest performance. In order to do so we will have to dig deep into the implementation. Happy learning



要查看或添加评论,请登录

Hariharan G的更多文章

  • Unleash the power of SQLMesh

    Unleash the power of SQLMesh

    In the current data engineering space we many tools which are trending (duckdb, dbt, airbyte, polars, datafusion etc)…

    2 条评论

社区洞察

其他会员也浏览了