Demystify different compression codec in big data
When we are working with big data files like Parquet, ORC, Avro etc then you will mostly come across different compression codec like snappy, lzo, gzip, bzip2 etc. In this article, we will try to understand some of these compression codecs and discuss basic fundamental differences between them.
Before starting anything let's try to understand the benefit of compressing the big data files. File compression brings two major benefits:
When dealing with large volumes of data, both of these savings can be significant but yes compression and decompression comes with some cost in terms of time taken to compress & decompress the data but when we compare this cost with the I/O gain, we can neglect the additional time it is taking.
When dealing with any big data system or distributed system it is very much important to select the right compression codec but the first question which comes into our mind is What does this codec mean?
Codec is short for compressor/decompressor. It refers to the software, hardware, or a combination of the two. One can use codecs to apply compression/decompression algorithms to data & it has two components, an encoder to compress the files, and a decoder to decompress.
The type of codec that we can use depends on the data and file type we are trying to compress. It also depends on whether we need our compressed file to be splittable. Splittable files can be processed in parallel by different processors.
Which compression format should we use?
Which compression format we should use depends on our application. The two most important factors determine our option and those are
Some of the compression codecs are more optimized for storage while others are more optimized for speed. Do ones want to maximize the speed of the application or are more concerned about keeping storage costs down? So basically there is a trade-off between storage and speed. If we want more compression ratio then we have to spend more time in compression but if we want better speed then generally we will not focus or spend more time in storage compression.
In general, one should try different strategies for the application, and benchmark them with representative datasets to find the best approach. There are more important parameters that are helpful while choosing a compression codec.
For large files, one should?not?use a compression format that does not support splitting on the whole file, since one can lose locality and make MapReduce applications or any distributed computing engine like spark very inefficient.
领英推荐
Different compression codecs in big data files.
There are many compression codecs available for big data files. Our goal is not to understand every codec available but to understand basic thinking while choosing different options available. So let's discuss some of the following codec that is useful for compressing big data.
I have created one illustration for these. Let's look at the below picture once and then we will discuss each of these compression codecs individually.
Snappy
Lzo
Gzip
Bzip2
So if we have to conclude all these above points then we can divide these compression codecs into two different categories. If we working with some big data system that has cold big data meaning if data is not accessed more often and we want to save storage cost then we can opt for compression coded like Gzip which is more suited for storage rather than speed. Whereas, if we are dealing with hot big data like real-time big data pipeline or batch big data pipeline where we want to access the big data more often then we should opt for compression coded like Snappy which gives a better balance between speed and storage.
I have also recorded one Youtube video for the above content. One can follow this link.
Data Engineer | Spark | Hadoop | Hive | Sqoop | SQL | Scala | Linux Scripting | AWS Athena | AWS Redshift | Azure Data Factory
2 年Learning from your experience Keep sharing!