登录查看更多内容

Demystify different compression codec in big data

Ankur Ranjan

Software Engineer by heart, Data Engineer by mind

发布日期: 2022年3月22日

When we are working with big data files like Parquet, ORC, Avro etc then you will mostly come across different compression codec like snappy, lzo, gzip, bzip2 etc. In this article, we will try to understand some of these compression codecs and discuss basic fundamental differences between them.

Before starting anything let's try to understand the benefit of compressing the big data files. File compression brings two major benefits:

It reduces the space needed to store files.
It speeds up data transfer across the network, or to or from disk. Hence, it reduces the I/O cost.

When dealing with large volumes of data, both of these savings can be significant but yes compression and decompression comes with some cost in terms of time taken to compress & decompress the data but when we compare this cost with the I/O gain, we can neglect the additional time it is taking.

When dealing with any big data system or distributed system it is very much important to select the right compression codec but the first question which comes into our mind is What does this codec mean?

Codec is short for compressor/decompressor. It refers to the software, hardware, or a combination of the two. One can use codecs to apply compression/decompression algorithms to data & it has two components, an encoder to compress the files, and a decoder to decompress.

The type of codec that we can use depends on the data and file type we are trying to compress. It also depends on whether we need our compressed file to be splittable. Splittable files can be processed in parallel by different processors.

Which compression format should we use?

Which compression format we should use depends on our application. The two most important factors determine our option and those are

Storage
Speed

Some of the compression codecs are more optimized for storage while others are more optimized for speed. Do ones want to maximize the speed of the application or are more concerned about keeping storage costs down? So basically there is a trade-off between storage and speed. If we want more compression ratio then we have to spend more time in compression but if we want better speed then generally we will not focus or spend more time in storage compression.

In general, one should try different strategies for the application, and benchmark them with representative datasets to find the best approach. There are more important parameters that are helpful while choosing a compression codec.

Compression ratio - How much data has been compressed or how good the compression itself is from source to destination?
Throughput, Compression Speed, Decompression Speed - How quickly the algorithm can compress the data & decompress the data? Throughput is mostly measured in Mb/s.

For large files, one should?not?use a compression format that does not support splitting on the whole file, since one can lose locality and make MapReduce applications or any distributed computing engine like spark very inefficient.

领英推荐

Version Vectors(I)

Pratik Pandey 2 年前

Delta Lake :The Time Traveller's Data Guide

Hassan Syed 10 个月前

ClickHouse Unleashed: Transforming Real-Time Analytics…

Ashok M. 4 个月前

Different compression codecs in big data files.

There are many compression codecs available for big data files. Our goal is not to understand every codec available but to understand basic thinking while choosing different options available. So let's discuss some of the following codec that is useful for compressing big data.

Snappy
Lzo
Gzip
Bzip2

I have created one illustration for these. Let's look at the below picture once and then we will discuss each of these compression codecs individually.

Snappy

Snappy is a very fast compression codec, however, in terms of compression, it is not very good.
In most of the project and distributed systems, it is by default choice of compression codec because it gives a good balance between compression and speed.
However, it is more optimized for speed rather than storage.
Snappy by default is not splittable inherently but snappy is mostly used with container-based file formats like Parquet, Avro, orc which itself take care of split ability.

Lzo

Lzo is also optimized for speed like snappy but unlike snappy, it is inherently splittable.
It is also more optimized for speed than storage.

Gzip

It is more optimized for storage.
In terms of processing speed, it is slow.
Gzip is also not inherently splittable so one should use this with container-based file formats like Parquet or ORC.
It uses Deflate algorithm for compression.

Bzip2

It is very much optimized for storage but it is slow w.r.t. speed.
It is inherently splittable.

So if we have to conclude all these above points then we can divide these compression codecs into two different categories. If we working with some big data system that has cold big data meaning if data is not accessed more often and we want to save storage cost then we can opt for compression coded like Gzip which is more suited for storage rather than speed. Whereas, if we are dealing with hot big data like real-time big data pipeline or batch big data pipeline where we want to access the big data more often then we should opt for compression coded like Snappy which gives a better balance between speed and storage.

I have also recorded one Youtube video for the above content. One can follow this link.

The Big Data Show

18,214 位关注者

sachin singh

2 年

Learning from your experience Keep sharing!

要查看或添加评论，请登录

Ankur Ranjan的更多文章

Apache Arrow Flight

2025年1月24日

Apache Arrow Flight

A Few Days Ago During a conversation with my lead, Karthic Rao , at e6data , I was introduced to a fascinating…

22 条评论
Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

2024年5月19日

Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

What is Apache Kafka? Why is Kafka considered an event streaming platform & more over what does an event actually mean?…

6 条评论
Spark Dynamic Resource Allocation

2024年4月26日

Spark Dynamic Resource Allocation

One of Spark's great features is the support of dynamic resource allocations. Still, with my experience in the last 5…

6 条评论
Intro to Kafka Security for Data Engineers - Part 1

2023年10月8日

Intro to Kafka Security for Data Engineers - Part 1

I have a story about Kafka and Data Engineers that I'd like to share. In the world of Data Engineering, there are two…

10 条评论
Apache Hudi: Copy on Write(CoW) Table

2023年9月24日

Apache Hudi: Copy on Write(CoW) Table

As Data Engineer, we frequently encounter the tedious task of performing multiple UPSERT(update + insert) and DELETE…

11 条评论
Solve Small File Problem using Apache Hudi

2023年8月25日

Solve Small File Problem using Apache Hudi

One of the biggest pains of Data Engineers is small file problems. Let me tell you a short story and explain how one of…
Data Swamp - A problem arises due to the love life of Data Engineers.

2023年8月21日

Data Swamp - A problem arises due to the love life of Data Engineers.

Philosophy and the cycle of love even hold in the world of Data Engineering. Let me help you understand how the love of…

2 条评论
Supercharging Apps with Polyglot Persistence: A Simple Guide

2023年8月15日

Supercharging Apps with Polyglot Persistence: A Simple Guide

After working for more than 4 years on Data Intensive applications in a startup, consultancy and product-based…

4 条评论
Optimize Google BigQuery

2023年7月31日

Optimize Google BigQuery

I love BigQuery and think It is one of the best products ever made by the Google Cloud Platform. As someone who works…

6 条评论
Stateful transformations in Spark Streaming - Part 1

2023年2月26日

Stateful transformations in Spark Streaming - Part 1

In the previous article of this series i.e.

7 条评论

See all articles

Demystify different compression codec in big data

Ankur Ranjan

Software Engineer by heart, Data Engineer by mind

Which compression format should we use?

领英推荐

Different compression codecs in big data files.

Snappy

Lzo

Gzip

Bzip2

The Big Data Show

18,214 位关注者

Ankur Ranjan的更多文章

社区洞察

其他会员也浏览了

ClickHouse Unleashed: Transforming Real-Time Analytics with Speed, Scale, and Simplicity?? ??

Stat Rules

Why should I use TypeDB for my graph data?

Computers vs. human brain, who’s winning? PART 1

“Large” is relative, not?

Unlocking the Power of Spark RDD Coalesce: A Dive into Narrow and Controlled Wide Transformations

Going storageless with Hammerspace keeps data available, protected and consistent across sites simply and efficiently.

Block Report and Heart Beat

Spark session property tweaks for better control over VACUUM and OPTIMIZE

Which compression format should we use?

领英推荐

Different compression codecs in big data files.

Snappy

Lzo

Gzip

Bzip2

The Big Data Show

18,214 位关注者

Ankur Ranjan的更多文章

Apache Arrow Flight

Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

Spark Dynamic Resource Allocation

Intro to Kafka Security for Data Engineers - Part 1

Apache Hudi: Copy on Write(CoW) Table

Solve Small File Problem using Apache Hudi

Data Swamp - A problem arises due to the love life of Data Engineers.

Supercharging Apps with Polyglot Persistence: A Simple Guide

Optimize Google BigQuery

Stateful transformations in Spark Streaming - Part 1

社区洞察

其他会员也浏览了

ClickHouse Unleashed: Transforming Real-Time Analytics with Speed, Scale, and Simplicity?? ??

Stat Rules

Why should I use TypeDB for my graph data?

Computers vs. human brain, who’s winning? PART 1

“Large” is relative, not?

Unlocking the Power of Spark RDD Coalesce: A Dive into Narrow and Controlled Wide Transformations

Going storageless with Hammerspace keeps data available, protected and consistent across sites simply and efficiently.

Block Report and Heart Beat

Spark session property tweaks for better control over VACUUM and OPTIMIZE