Solve Small File Problem using Apache Hudi
One of the biggest pains of Data Engineers is small file problems.
Let me tell you a short story and explain how one of the efficient tools solves this problem.
A few days ago, while UPSERTING data to my Apache Hudi table, I was observing the pipeline result and noticed that my small files were being compacted and converted into larger files. ????????
I experienced awe for a few minutes until I discovered the magical capabilities of Apache Hudi.
One of the best features that Apache Hudi provides is to overcome the dreaded small files problem.
For those unfamiliar with Apache Hudi, here's a brief definition and usage.
Apache Hudi is an open table file format that can be used efficiently to enable Data Lakehouse. It provides multiple supports in building Data Lakehouse but according to me, the most impactful are the following three.
To give better context, I feel Hudi is not only open table file formats, it has multiple other features that amaze me every day.
But this post is not regarding What is Apache Hudi and where to use it. It is more about one of its features that I fell in love with in recent times i.e. its ability to deal with small file problems.
One design decision in Hudi was to avoid creating small files by always writing properly sized files. There are two ways in which Hudi solves this issue.
???????? ???????? ???????????? ??????????????????: For COW
Automatically managing file sizes during ingestion may cause slight latency, but ensures efficient read queries immediately after a write is committed. Failing to manage file sizes during writing will result in slow queries until a resize cleanup is completed.
There are two important parameters that Apache Hudi uses during this process.
领英推荐
This config sets the file size limit below which a storage file becomes a candidate to be selected as such a?small file. By default, treat any file <= 100MB as a small file.
Let's try to understand by taking one example for the COW(Copy on Write) hudi table.
So let's suppose our configuration is set to the following.
????????????.??????????????.??????.????????.????????: 120 MB
????????????.??????????????.??????????.????????.??????????: 100 MB
Now let's see how Apache Hudi solves small-size problems step by step for the COW table.
?? Sizing Up Small Files with Hudi
Hudi helps your data find the right fit, just like sorting puzzle pieces to complete a picture! ??????
Let's understand by one illustration.
NOTE: In either case, small files will be auto-sized only if there is no PENDING compaction or associated log file for that particular file slice. For example, if you had a log file and a compaction C1 was scheduled to convert that log file to parquet, no more inserts can go into that log file.
This algorithm might seem a little robust :) but trust me Apache Hudi is magical when it comes to handling small file problems.
We have discussed a lot of concepts in this article. It is becoming long. I hope some of my readers who do understand the File Sizing algorithm for the Apache Hudi MOR(Merge On Read) table will post their explanation in the comment section or Feel free to subscribe to my YouTube channel i.e. The Big Data Show. I might upload a more detailed discussion of the above concepts in the coming days.
More so, thank you for that most precious gift to a me as writer i.e. your time.