Demystify different compression codec in big data

Demystify different compression codec in big data


When we are working with big data files like Parquet, ORC, Avro etc then you will mostly come across different compression codec like snappy, lzo, gzip, bzip2 etc. In this article, we will try to understand some of these compression codecs and discuss basic fundamental differences between them.

Before starting anything let's try to understand the benefit of compressing the big data files. File compression brings two major benefits:

  • It reduces the space needed to store files.
  • It speeds up data transfer across the network, or to or from disk. Hence, it reduces the I/O cost.

When dealing with large volumes of data, both of these savings can be significant but yes compression and decompression comes with some cost in terms of time taken to compress & decompress the data but when we compare this cost with the I/O gain, we can neglect the additional time it is taking.

When dealing with any big data system or distributed system it is very much important to select the right compression codec but the first question which comes into our mind is What does this codec mean?

Codec is short for compressor/decompressor. It refers to the software, hardware, or a combination of the two. One can use codecs to apply compression/decompression algorithms to data & it has two components, an encoder to compress the files, and a decoder to decompress.

The type of codec that we can use depends on the data and file type we are trying to compress. It also depends on whether we need our compressed file to be splittable. Splittable files can be processed in parallel by different processors.

Which compression format should we use?

Which compression format we should use depends on our application. The two most important factors determine our option and those are

  1. Storage
  2. Speed

Some of the compression codecs are more optimized for storage while others are more optimized for speed. Do ones want to maximize the speed of the application or are more concerned about keeping storage costs down? So basically there is a trade-off between storage and speed. If we want more compression ratio then we have to spend more time in compression but if we want better speed then generally we will not focus or spend more time in storage compression.

In general, one should try different strategies for the application, and benchmark them with representative datasets to find the best approach. There are more important parameters that are helpful while choosing a compression codec.

  • Compression ratio - How much data has been compressed or how good the compression itself is from source to destination?
  • Throughput, Compression Speed, Decompression Speed - How quickly the algorithm can compress the data & decompress the data? Throughput is mostly measured in Mb/s.

For large files, one should?not?use a compression format that does not support splitting on the whole file, since one can lose locality and make MapReduce applications or any distributed computing engine like spark very inefficient.

Different compression codecs in big data files.

There are many compression codecs available for big data files. Our goal is not to understand every codec available but to understand basic thinking while choosing different options available. So let's discuss some of the following codec that is useful for compressing big data.

  • Snappy
  • Lzo
  • Gzip
  • Bzip2

I have created one illustration for these. Let's look at the below picture once and then we will discuss each of these compression codecs individually.

No alt text provided for this image

Snappy

  • Snappy is a very fast compression codec, however, in terms of compression, it is not very good.
  • In most of the project and distributed systems, it is by default choice of compression codec because it gives a good balance between compression and speed.
  • However, it is more optimized for speed rather than storage.
  • Snappy by default is not splittable inherently but snappy is mostly used with container-based file formats like Parquet, Avro, orc which itself take care of split ability.

Lzo

  • Lzo is also optimized for speed like snappy but unlike snappy, it is inherently splittable.
  • It is also more optimized for speed than storage.

Gzip

  • It is more optimized for storage.
  • In terms of processing speed, it is slow.
  • Gzip is also not inherently splittable so one should use this with container-based file formats like Parquet or ORC.
  • It uses Deflate algorithm for compression.

Bzip2

  • It is very much optimized for storage but it is slow w.r.t. speed.
  • It is inherently splittable.

So if we have to conclude all these above points then we can divide these compression codecs into two different categories. If we working with some big data system that has cold big data meaning if data is not accessed more often and we want to save storage cost then we can opt for compression coded like Gzip which is more suited for storage rather than speed. Whereas, if we are dealing with hot big data like real-time big data pipeline or batch big data pipeline where we want to access the big data more often then we should opt for compression coded like Snappy which gives a better balance between speed and storage.

I have also recorded one Youtube video for the above content. One can follow this link.

sachin singh

Data Engineer | Spark | Hadoop | Hive | Sqoop | SQL | Scala | Linux Scripting | AWS Athena | AWS Redshift | Azure Data Factory

2 年

Learning from your experience Keep sharing!

回复

要查看或添加评论,请登录

Ankur Ranjan的更多文章

  • Apache Arrow Flight

    Apache Arrow Flight

    A Few Days Ago During a conversation with my lead, Karthic Rao , at e6data , I was introduced to a fascinating…

    22 条评论
  • Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

    Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

    What is Apache Kafka? Why is Kafka considered an event streaming platform & more over what does an event actually mean?…

    6 条评论
  • Spark Dynamic Resource Allocation

    Spark Dynamic Resource Allocation

    One of Spark's great features is the support of dynamic resource allocations. Still, with my experience in the last 5…

    6 条评论
  • Intro to Kafka Security for Data Engineers - Part 1

    Intro to Kafka Security for Data Engineers - Part 1

    I have a story about Kafka and Data Engineers that I'd like to share. In the world of Data Engineering, there are two…

    10 条评论
  • Apache Hudi: Copy on Write(CoW) Table

    Apache Hudi: Copy on Write(CoW) Table

    As Data Engineer, we frequently encounter the tedious task of performing multiple UPSERT(update + insert) and DELETE…

    11 条评论
  • Solve Small File Problem using Apache Hudi

    Solve Small File Problem using Apache Hudi

    One of the biggest pains of Data Engineers is small file problems. Let me tell you a short story and explain how one of…

  • Data Swamp - A problem arises due to the love life of Data Engineers.

    Data Swamp - A problem arises due to the love life of Data Engineers.

    Philosophy and the cycle of love even hold in the world of Data Engineering. Let me help you understand how the love of…

    2 条评论
  • Supercharging Apps with Polyglot Persistence: A Simple Guide

    Supercharging Apps with Polyglot Persistence: A Simple Guide

    After working for more than 4 years on Data Intensive applications in a startup, consultancy and product-based…

    4 条评论
  • Optimize Google BigQuery

    Optimize Google BigQuery

    I love BigQuery and think It is one of the best products ever made by the Google Cloud Platform. As someone who works…

    6 条评论
  • Stateful transformations in Spark Streaming - Part 1

    Stateful transformations in Spark Streaming - Part 1

    In the previous article of this series i.e.

    7 条评论

社区洞察

其他会员也浏览了