File Formats in Big Data World - Part 1
Ankur Ranjan
Building DatosYard | YouTube (100k) - The Big Data Show | Software Engineer by heart, Data Engineer by mind
One of the most fundamental decisions to make in the Data Engineering world is to choose the proper file formats in different zones of the Big Data Pipeline. It helps the team to fetch the data faster and lower the cost of the project. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workload, particularly for Big Data Pipeline.
Now the question is why we need different file formats. This happens due to mainly three reasons and they are as follows.
With the increasing volume of data, managing cost becomes a crucial task for a Data Engineer. With changing times, many file formats have evolved in the big data world but according to me most of them have always tried to follow some of the basic design principles and they are as follows.
Achieving all the above supports in single file formats is almost impossible. Some are made for making reading faster, while some are made to support write-heavy jobs. Some do support the schema evolution greatly, while some support partially and some don't support it at all.
But, the good thing about Data Engineering is that most of the time we don't need all the above-mentioned things. Data Pipeline consists of different zones like landing zone, raw zone, cleaned zone, and curated/consumption zone which demand different supports.
Let's try to understand it by looking at a typical Data Pipeline example illustration.
Let's try to understand different zone first.
So, By looking through different zone, we have understood that out of those five supports, few can do our work. We should understand that some of the file formats are designed for very specific use cases while others are very much for general purposes.
In the Big Data world, generally, files are divided into two categories.
Let's try to understand the basic difference b/w these by a simple illustration.
The above illustration shows the high level of picture storage at the disk level. One can see the data of columns in the column-oriented database stay together and it's very easy to fetch some of the columns out of many columns if using this format.
Let me try to make it more simple.
Row-oriented storage is suitable for situations where the entire row of data needs to be processed simultaneously where a column-oriented format makes it possible to skip unneeded columns when reading data, and suitable for situations when reading a few columns out of many columns. hen querying, columnar storage you can skip over the non-relevant data very quickly. As a result, aggregation queries are less time-consuming compared to row-oriented databases.
Most column-oriented formatted files are good for reading whereas row-oriented do support writing better. Compression is a little better for column-oriented and hence make it more suitable for analytical and storage purpose.
Now, we have let's get deep and try to understand the basic feature of Parquet, ORC and Avro. This will help us understand the concept better and then we will be able to visualize the difference b/w all these file formats.
Apache Parquet
Let's start with very general purpose column-oriented file formats.
Now the first question that comes to mind is. What is Apache Parquet?
Let me make it simple for you with one intro and then deep diving into its internal.
Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
Some of the characteristics of this file format are as follows.
Now let's dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes.
Let me open one secret to you ??
Parquet is referred to as a columnar format in many books but internally it is like a hybrid format i.e. Combination of Row and Columnar ????
No way, what you are telling now, Ankur? Are you kidding?
I am not kidding. It is actually a hybrid file format. Even ORC also follows this format.
领英推荐
Instead of wondering and panicking, let's see what hybrid file formats mean.
In a hybrid columnar layout columns still follow each other but?columns are split into chunks. A hybrid approach enables as well horizontal and vertical partitioning and hence fast filtering and scanning of data. It keeps the advantages of homogenous data and the ability to apply specific encoding and compression techniques efficiently.
Here A<Int>, B<Int> and c<Int> are three different columns.
This is actually what Parquet does:
Let's now look at one illustration of Apache Parquet data organization.
As I have said, it has both horizontal partitioning and vertical partitioning. At horizontal partition, it is called row group. Parquet has multiple row groups in a single file with a default size of 128 MB. Inside this row group, it has vertical partitioning and data of column lie here in a columnar fashion. This is called Column Chunk. Within the column chunk, actual data is stored. It is called data pages.
Data pages have multiple metadata information like min, max, count etc. This helps Apache Parquet to provide the predicate pushdown support which means that it will scan only those pages whose metadata information like min and max lie inside the filters applied while querying data. One thing to note here is that metadata information is also stored at the row group level but this is stored at the footer.
So when a filter is applied then it will first look to the header which has one magical number PAR1 which identifies that it is a parquet file and then it will go to the footer where it looks into metadata information. Using metadata information it will identify that It has to look into this row group. In the row group, it will use the metadata information of data pages and column chunks to scan only the required pages.
Isn't this methodology brilliant???
Let's look into how parquet store data.
When writing a Parquet file, most implementations will use dictionary encoding to compress a column until the dictionary itself reaches a certain size threshold, usually around 1 megabyte. At this point, the column writer will “fall back” to?PLAIN?encoding where values are written end-to-end in “data pages” and then usually compressed with Snappy or Gzip. See the following rough diagram:
These concepts area little hard to understand but let me try to explain in simpler terms.
Let's look at one example of dictionary encoding in the parquet files.
Let's suppose our column has the following data which has much duplicate values.
['apple', 'orange', 'apple', NULL, 'orange', 'orange']
Parquet will store this in dictionary-encoded form
dictionary: ['apple', 'orange']
indices: [0, 1, 0, NULL, 1, 1]
One can see that using this it will always save to storage and can save more data in the same space.
Parquet also use bit packing which means that it will use the optimized number of bytes to store the data. For example, let's suppose our column has data like
[0, 2, 6, 7, 9, 0, 1]
So internally Parquet will store these data as bytes instead of Int. It saves storage and space. It uses a complex mechanism to store the right byte size.
Apache Parquet uses snappy compression by default when it is used with computing engines like Apache Spark. This ensures the right balance b/w compression ratio and speed.
I think now we have got enough reason of making Apache Parquet one of the best file formats in the analytical world where we are trying to apply filters, aggregation and select some of the columns out of multiple columns.
I hope you have enjoyed this article and learned something from this. Let's connect in the next article and learn more about ORC and Avro file formats.
You can also connect with me on my Youtube channel i.e. The Big Data Show . Here I try to help aspiring Data engineers through tutorials, podcasts, mock interviews etc.
--
7 个月Can you give me the link to part 2?
Engineering Lead - Data@Persistent Systems :::: Azure Databricks || Azure Delta Lake || Python || SQL || Pyspark || Pandas || Hadoop Ecosystem || Git || Git Hub || Excel || Power BI
2 年Thanks for this. I was looking to find good material on file formats in Big data. Here I find it . So well written. ??
Data Engineer at EPAM Systems | Big Data | Spark | PySpark | Azure Data Factory | Azure Synapse Analytics | Azure Logic Apps | DataBricks | Hadoop | Hive | Azure | SQL | Ab initio | ETL/ELT
2 年Nicely articulated ??
Data Engineering Leader
2 年Nice article!
Great share !