登录查看更多内容

All You Need To Know About Parquet File Structure In Depth

Rohan Karanjawala

Engagement Management and Technical Delivery | Big Data & Cloud at Capgemini | Data and Insights | Financial Services

发布日期: 2020年1月7日

In my previous article

I had explained the ORC file structure. It received a huge response and that pushed me to write a new article on the parquet file format.

Here in this article, I will be explaining about the Parquet file structure. I hope that after this article you will understand the Parquet File format and how data is stored in it.

Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile format and ORC format.

Parquet file format consists of 2 parts –

1. Data

2. Metadata.

Data is written first in the file and the metadata is written at the end to allow for single pass writing. Let’s see the parquet file format first and then lets us have a look at the metadata.

File Format -

A sample parquet file format is as below -

At a high level, the parquet file consists of header, one or more blocks and footer.

The parquet file format contains a 4-byte magic number in the header (PAR1) and at the end of the footer. This is a magic number indicates that the file is in parquet format. All the file metadata is stored in the footer section.Later in the blog, I’ll explain the advantage of having the metadata in the footer section.

Blocks in the parquet file are written in the form of a nested structure as below -

-Blocks

-Row Groups

--Column Chunks

--Page

Each block in the parquet file is stored in the form of row groups. So, data in a parquet file is partitioned into multiple row groups. These row groups in turn consists of one or more column chunks which corresponds to a column in the dataset. The data for each column chunk is then written in the form of pages. Each page contains values for a particular column only, hence pages are very good candidates for compression as they contain similar values.

As we have seen above the file metadata is stored in the footer. The footer's metadata includes the version of the format, the schema, any extra key-value pairs, and metadata for columns in the file. The column metadata would be type, path, encoding, number of values, compressed size etc.

Apart from the file metadata, it also has a 4-byte field encoding the length of the footer metadata, and a 4-byte magic number (PAR1)

In case of Parquet files, metadata is written after the data has been written, to allow for single pass writing.Since the metadata is stored in the footer, while reading a parquet file, an initial seek will be performed to read the footer metadata length and then a backward seek will be performed to read the footer metadata.

In other files like sequence and Avro, metadata is stored in header and sync markers are used to separate blocks whereas in parquet, block boundaries are directly stored in the footer metadata. It is possible to do this since the metadata is written after all the blocks have been written.Therefore, parquet files are splittable since the block boundaries can be read from footer metadata and blocks can be easily located and processed in parallel.

Hope this provides a good overview on the Parquet file structure.

About Me:

(I work with Ellicium solutions pvt ltd as an Architect and technical manager looking after projects in big data and analytics and helping clients to stay ahead in the competition and more importantly to serve their customers well)

T S Kishorkrishna

Senior Data Engineer at Publicis Sapient | Databricks Certified - Data Engineer Professional

5 个月

Should each and every row group contain all columns ?

Akash Mittal

3 年

How we can define that we want to store 100 rows in one part file?

Vishal Gupta

Senior Data Engineer || Google Cloud Certified Professional || Driven By Personalized Customer Experience

3 年

Thanks for sharing.

Sudipta Saha

Lead Software Engineer @ Optum | Data Engineering

3 年

I notice Parquet files to internally have multiple Part files like Part-0000/Part-0001 etc. Are these Row Groups?

Kamlesh Patil

Associate Director at UBS

5 年

Very well written.

1 次回应

查看更多评论

要查看或添加评论，请登录

Rohan Karanjawala的更多文章

Hadoop Security: Prime Areas To Focus On – Part 2

2017年7月31日

Hadoop Security: Prime Areas To Focus On – Part 2

This is a continuation to my previous article on ‘Hadoop Security: Prime Areas To Focus On’. If you have not read my…
Hadoop Security: Prime Areas To Focus On

2017年7月21日

Hadoop Security: Prime Areas To Focus On

“The precondition to freedom is security” As rightly mentioned by Rand Beers, the security advisor to president Barack…
ORC Vs Parquet Vs Avro How to select a right file format for Hive?

2017年7月3日

ORC Vs Parquet Vs Avro How to select a right file format for Hive?

ORC Or Parquet Or Avro : Which one is the better of the lot? People working in Hive would be asking this question more…

3 条评论
All You Need To Know About ORC File Structure In Depth

2017年6月19日

All You Need To Know About ORC File Structure In Depth

Want to store data in Hive tables, just wondering which file format to use, ORC or Parquet? Well this is a question…

5 条评论
How can you stay afloat and flourish in your career?

2016年9月16日

How can you stay afloat and flourish in your career?

Whether you have just started your career or have completed couple of work anniversaries, "How can I stay afloat and…

8 条评论

See all articles

All You Need To Know About Parquet File Structure In Depth

Rohan Karanjawala

Engagement Management and Technical Delivery | Big Data & Cloud at Capgemini | Data and Insights | Financial Services

Rohan Karanjawala的更多文章

社区洞察

其他会员也浏览了

Demystifying the concept Of Parallelism when Upload the BigData in Hadoop

Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not

Data Modeling in the Big Data Era: HDFS

Task-7:Elastic Task or Docker Task.

Golden Monkey Go March in! Revo R (updated)

Integration of LVM with Hadoop

To Hub or Not to Hub, That is the Question ...

Big Data Quick Tricks(Hive-Fixing Small File Issue)

Reading unconventional raw data with Custom RecordReader in Spark using Hadoop new API

Understanding Apache Carbondata from the basics — Part I

Rohan Karanjawala的更多文章

Hadoop Security: Prime Areas To Focus On – Part 2

Hadoop Security: Prime Areas To Focus On

ORC Vs Parquet Vs Avro How to select a right file format for Hive?

All You Need To Know About ORC File Structure In Depth

How can you stay afloat and flourish in your career?

社区洞察

其他会员也浏览了

Demystifying the concept Of Parallelism when Upload the BigData in Hadoop

Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not

Data Modeling in the Big Data Era: HDFS

Task-7:Elastic Task or Docker Task.

Golden Monkey Go March in! Revo R (updated)

Integration of LVM with Hadoop

To Hub or Not to Hub, That is the Question ...

Big Data Quick Tricks(Hive-Fixing Small File Issue)

Reading unconventional raw data with Custom RecordReader in Spark using Hadoop new API

Understanding Apache Carbondata from the basics — Part I