Solve Small File Problem using Apache Hudi

Solve Small File Problem using Apache Hudi

One of the biggest pains of Data Engineers is small file problems.


Let me tell you a short story and explain how one of the efficient tools solves this problem.


A few days ago, while UPSERTING data to my Apache Hudi table, I was observing the pipeline result and noticed that my small files were being compacted and converted into larger files. ????????


I experienced awe for a few minutes until I discovered the magical capabilities of Apache Hudi.


One of the best features that Apache Hudi provides is to overcome the dreaded small files problem.

For those unfamiliar with Apache Hudi, here's a brief definition and usage.

Apache Hudi is an open table file format that can be used efficiently to enable Data Lakehouse. It provides multiple supports in building Data Lakehouse but according to me, the most impactful are the following three.


  1. Efficient Ingestion: Support for mutability, Row level updates & and deletes.
  2. Efficient reading/writing performance: Support for multiple types of index to make your write and read faster, support for MOR table, Improved file layout and timeline.
  3. Concurrency control and ACID guarantees.


To give better context, I feel Hudi is not only open table file formats, it has multiple other features that amaze me every day.

But this post is not regarding What is Apache Hudi and where to use it. It is more about one of its features that I fell in love with in recent times i.e. its ability to deal with small file problems.


One design decision in Hudi was to avoid creating small files by always writing properly sized files. There are two ways in which Hudi solves this issue.

  • Auto-Size During ingestion: (For both CoW and MoR table)
  • Auto-Size With Clustering

???????? ???????? ???????????? ??????????????????: For COW

Automatically managing file sizes during ingestion may cause slight latency, but ensures efficient read queries immediately after a write is committed. Failing to manage file sizes during writing will result in slow queries until a resize cleanup is completed.


There are two important parameters that Apache Hudi uses during this process.


  1. ????????????.??????????????.??????.????????.????????:Target size in bytes for parquet files produced by Hudi write phases.
  2. ????????????.??????????????.??????????.????????.??????????:During upsert operation, Hudi opportunistically expands existing small files on storage, instead of writing new files, to keep the number of files to an optimum.


This config sets the file size limit below which a storage file becomes a candidate to be selected as such a?small file. By default, treat any file <= 100MB as a small file.

Let's try to understand by taking one example for the COW(Copy on Write) hudi table.

So let's suppose our configuration is set to the following.

????????????.??????????????.??????.????????.????????: 120 MB

????????????.??????????????.??????????.????????.??????????: 100 MB

Now let's see how Apache Hudi solves small-size problems step by step for the COW table.

?? Sizing Up Small Files with Hudi

  1. ?? File Size Controls: Think of it like adjusting the size of a picture! Hudi lets you set the maximum size for base/Parquet files(????????????.??????????????.??????.????????.????????). You can also decide when a file is "small" based on a soft limit(????????????.??????????????.??????????.????????.??????????)
  2. ?? Smart Start: When you're starting a Hudi table, estimating the size of each record is like fitting puzzle pieces. You're aiming to neatly fit your data records into a Parquet file, so they're arranged well and take up less storage space.
  3. ?? Memory Lane: As you keep writing, Hudi remembers the average record size from previous times. This helps it write and organize data better.
  4. ?? Writing Magic: Imagine Hudi as a clever organizer. It adds more records to small files as you write, kind of like filling up a box. The goal? To reach the maximum size you set. For example, if the file size is 40 MB and the small.file.limit is set to 100 and the max.file.size is set to 120 MB. During the next set of inserts, it will try to add 80 MB of file to make it to 120 MB.
  5. ?? The Perfect Fit: Let's say you set a compactionSmallFileSize of 100MB and a limitFileSize of 120MB. Hudi looks at files smaller than 100MB and adds data until they're a comfy 120MB.

Hudi helps your data find the right fit, just like sorting puzzle pieces to complete a picture! ??????


Let's understand by one illustration.

  • Let's suppose, For the first time when you wrote to Hudi using Spark & it created 5 files.

  • During the second insert, it will identify, File 1, File 2 & and File 3 as a small file because we have set ????????????.??????????????.??????????.????????.??????????: 100 MB. So incoming inserts are assigned to them first. For example, Hudi will try to assign more 90 MB to file 1, 100 MB to file 2, and 40 MB to file 3. So does that because we have set ????????????.??????????????.??????.????????.????????: 120 MB & we want to set this to full capacity.

incoming records are bin-packed to existing small files.


  • Now let's suppose after this if we are left with the remaining record to be inserted it will create a new file. For example, if you are left with 300 MB worth of remaining records then it might create 3 more files of 120 MB, 120 MB, and 60 MB respectively.
  • During the next run inserts again, it will try to add 60 MB of data to the last file first.
  • All of these file sizing algorithms, keep running to keep the optimal size of your file. This ensures that you don't have small file problems in your cloud storage.

NOTE: In either case, small files will be auto-sized only if there is no PENDING compaction or associated log file for that particular file slice. For example, if you had a log file and a compaction C1 was scheduled to convert that log file to parquet, no more inserts can go into that log file.


This algorithm might seem a little robust :) but trust me Apache Hudi is magical when it comes to handling small file problems.


We have discussed a lot of concepts in this article. It is becoming long. I hope some of my readers who do understand the File Sizing algorithm for the Apache Hudi MOR(Merge On Read) table will post their explanation in the comment section or Feel free to subscribe to my YouTube channel i.e. The Big Data Show. I might upload a more detailed discussion of the above concepts in the coming days.

More so, thank you for that most precious gift to a me as writer i.e. your time.

要查看或添加评论,请登录

Ankur Ranjan的更多文章

  • Apache Arrow Flight

    Apache Arrow Flight

    A Few Days Ago During a conversation with my lead, Karthic Rao , at e6data , I was introduced to a fascinating…

    22 条评论
  • Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

    Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

    What is Apache Kafka? Why is Kafka considered an event streaming platform & more over what does an event actually mean?…

    6 条评论
  • Spark Dynamic Resource Allocation

    Spark Dynamic Resource Allocation

    One of Spark's great features is the support of dynamic resource allocations. Still, with my experience in the last 5…

    6 条评论
  • Intro to Kafka Security for Data Engineers - Part 1

    Intro to Kafka Security for Data Engineers - Part 1

    I have a story about Kafka and Data Engineers that I'd like to share. In the world of Data Engineering, there are two…

    10 条评论
  • Apache Hudi: Copy on Write(CoW) Table

    Apache Hudi: Copy on Write(CoW) Table

    As Data Engineer, we frequently encounter the tedious task of performing multiple UPSERT(update + insert) and DELETE…

    11 条评论
  • Data Swamp - A problem arises due to the love life of Data Engineers.

    Data Swamp - A problem arises due to the love life of Data Engineers.

    Philosophy and the cycle of love even hold in the world of Data Engineering. Let me help you understand how the love of…

    2 条评论
  • Supercharging Apps with Polyglot Persistence: A Simple Guide

    Supercharging Apps with Polyglot Persistence: A Simple Guide

    After working for more than 4 years on Data Intensive applications in a startup, consultancy and product-based…

    4 条评论
  • Optimize Google BigQuery

    Optimize Google BigQuery

    I love BigQuery and think It is one of the best products ever made by the Google Cloud Platform. As someone who works…

    6 条评论
  • Stateful transformations in Spark Streaming - Part 1

    Stateful transformations in Spark Streaming - Part 1

    In the previous article of this series i.e.

    7 条评论
  • Kafka for Data Engineers

    Kafka for Data Engineers

    Kafka is the prominent queuing system which is the most used technology in all streaming solutions. In most of my…

    6 条评论

社区洞察

其他会员也浏览了