ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Solve Small File Problem using Apache Hudi

Ankur Ranjan

Software Engineer by heart, Data Engineer by mind

å‘å¸ƒæ—¥æœŸ: 2023å¹´8æœˆ25æ—¥

+ å…³æ³¨

One of the biggest pains of Data Engineers is small file problems.

Let me tell you a short story and explain how one of the efficient tools solves this problem.

A few days ago, while UPSERTING data to my Apache Hudi table, I was observing the pipeline result and noticed that my small files were being compacted and converted into larger files. ????????

I experienced awe for a few minutes until I discovered the magical capabilities of Apache Hudi.

One of the best features that Apache Hudi provides is to overcome the dreaded small files problem.

For those unfamiliar with Apache Hudi, here's a brief definition and usage.

Apache Hudi is an open table file format that can be used efficiently to enable Data Lakehouse. It provides multiple supports in building Data Lakehouse but according to me, the most impactful are the following three.

Efficient Ingestion: Support for mutability, Row level updates & and deletes.
Efficient reading/writing performance: Support for multiple types of index to make your write and read faster, support for MOR table, Improved file layout and timeline.
Concurrency control and ACID guarantees.

To give better context, I feel Hudi is not only open table file formats, it has multiple other features that amaze me every day.

But this post is not regarding What is Apache Hudi and where to use it. It is more about one of its features that I fell in love with in recent times i.e. its ability to deal with small file problems.

One design decision in Hudi was to avoid creating small files by always writing properly sized files. There are two ways in which Hudi solves this issue.

Auto-Size During ingestion: (For both CoW and MoR table)
Auto-Size With Clustering

???????? ???????? ???????????? ??????????????????: For COW

Automatically managing file sizes during ingestion may cause slight latency, but ensures efficient read queries immediately after a write is committed. Failing to manage file sizes during writing will result in slow queries until a resize cleanup is completed.

There are two important parameters that Apache Hudi uses during this process.

????????????.??????????????.??????.????????.????????:Target size in bytes for parquet files produced by Hudi write phases.
????????????.??????????????.??????????.????????.??????????:During upsert operation, Hudi opportunistically expands existing small files on storage, instead of writing new files, to keep the number of files to an optimum.

é¢†è‹±æŽ¨è

10 Future Apache Iceberg Developments to Look forward to in 2025

10 Future Apache Iceberg Developments to Look forwardâ€¦

Alex Merced 4 ä¸ªæœˆå‰

Understanding the Apache Iceberg Manifest File

Alex Merced 7 ä¸ªæœˆå‰

Data Lakehouse Roundup #1 - News and Insights on the Lakehouse

Data Lakehouse Roundup #1 - News and Insights on theâ€¦

Alex Merced 5 ä¸ªæœˆå‰

This config sets the file size limit below which a storage file becomes a candidate to be selected as such a?small file. By default, treat any file <= 100MB as a small file.

Let's try to understand by taking one example for the COW(Copy on Write) hudi table.

So let's suppose our configuration is set to the following.

????????????.??????????????.??????.????????.????????: 120 MB

????????????.??????????????.??????????.????????.??????????: 100 MB

Now let's see how Apache Hudi solves small-size problems step by step for the COW table.

?? Sizing Up Small Files with Hudi

?? File Size Controls: Think of it like adjusting the size of a picture! Hudi lets you set the maximum size for base/Parquet files(????????????.??????????????.??????.????????.????????). You can also decide when a file is "small" based on a soft limit(????????????.??????????????.??????????.????????.??????????)
?? Smart Start: When you're starting a Hudi table, estimating the size of each record is like fitting puzzle pieces. You're aiming to neatly fit your data records into a Parquet file, so they're arranged well and take up less storage space.
?? Memory Lane: As you keep writing, Hudi remembers the average record size from previous times. This helps it write and organize data better.
?? Writing Magic: Imagine Hudi as a clever organizer. It adds more records to small files as you write, kind of like filling up a box. The goal? To reach the maximum size you set. For example, if the file size is 40 MB and the small.file.limit is set to 100 and the max.file.size is set to 120 MB. During the next set of inserts, it will try to add 80 MB of file to make it to 120 MB.
?? The Perfect Fit: Let's say you set a compactionSmallFileSize of 100MB and a limitFileSize of 120MB. Hudi looks at files smaller than 100MB and adds data until they're a comfy 120MB.

Hudi helps your data find the right fit, just like sorting puzzle pieces to complete a picture! ??????

Let's understand by one illustration.

Let's suppose, For the first time when you wrote to Hudi using Spark & it created 5 files.

During the second insert, it will identify, File 1, File 2 & and File 3 as a small file because we have set ????????????.??????????????.??????????.????????.??????????: 100 MB. So incoming inserts are assigned to them first. For example, Hudi will try to assign more 90 MB to file 1, 100 MB to file 2, and 40 MB to file 3. So does that because we have set ????????????.??????????????.??????.????????.????????: 120 MB & we want to set this to full capacity.

incoming records are bin-packed to existing small files.

Now let's suppose after this if we are left with the remaining record to be inserted it will create a new file. For example, if you are left with 300 MB worth of remaining records then it might create 3 more files of 120 MB, 120 MB, and 60 MB respectively.
During the next run inserts again, it will try to add 60 MB of data to the last file first.
All of these file sizing algorithms, keep running to keep the optimal size of your file. This ensures that you don't have small file problems in your cloud storage.

NOTE: In either case, small files will be auto-sized only if there is no PENDING compaction or associated log file for that particular file slice. For example, if you had a log file and a compaction C1 was scheduled to convert that log file to parquet, no more inserts can go into that log file.

This algorithm might seem a little robust :) but trust me Apache Hudi is magical when it comes to handling small file problems.

We have discussed a lot of concepts in this article. It is becoming long. I hope some of my readers who do understand the File Sizing algorithm for the Apache Hudi MOR(Merge On Read) table will post their explanation in the comment section or Feel free to subscribe to my YouTube channel i.e. The Big Data Show. I might upload a more detailed discussion of the above concepts in the coming days.

More so, thank you for that most precious gift to a me as writer i.e. your time.

The Big Data Show

18,232 ä½å…³æ³¨è€…

è®¢é˜…

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Ankur Ranjançš„æ›´å¤šæ–‡ç«

Apache Arrow Flight

2025å¹´1æœˆ24æ—¥

Apache Arrow Flight

A Few Days Ago During a conversation with my lead, Karthic Rao , at e6data , I was introduced to a fascinatingâ€¦

22 æ¡è¯„è®º
Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

2024å¹´5æœˆ19æ—¥

Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

What is Apache Kafka? Why is Kafka considered an event streaming platform & more over what does an event actually mean?â€¦

6 æ¡è¯„è®º
Spark Dynamic Resource Allocation

2024å¹´4æœˆ26æ—¥

Spark Dynamic Resource Allocation

One of Spark's great features is the support of dynamic resource allocations. Still, with my experience in the last 5â€¦

6 æ¡è¯„è®º
Intro to Kafka Security for Data Engineers - Part 1

2023å¹´10æœˆ8æ—¥

Intro to Kafka Security for Data Engineers - Part 1

I have a story about Kafka and Data Engineers that I'd like to share. In the world of Data Engineering, there are twoâ€¦

10 æ¡è¯„è®º
Apache Hudi: Copy on Write(CoW) Table

2023å¹´9æœˆ24æ—¥

Apache Hudi: Copy on Write(CoW) Table

As Data Engineer, we frequently encounter the tedious task of performing multiple UPSERT(update + insert) and DELETEâ€¦

11 æ¡è¯„è®º
Data Swamp - A problem arises due to the love life of Data Engineers.

2023å¹´8æœˆ21æ—¥

Data Swamp - A problem arises due to the love life of Data Engineers.

Philosophy and the cycle of love even hold in the world of Data Engineering. Let me help you understand how the love ofâ€¦

2 æ¡è¯„è®º
Supercharging Apps with Polyglot Persistence: A Simple Guide

2023å¹´8æœˆ15æ—¥

Supercharging Apps with Polyglot Persistence: A Simple Guide

After working for more than 4 years on Data Intensive applications in a startup, consultancy and product-basedâ€¦

4 æ¡è¯„è®º
Optimize Google BigQuery

2023å¹´7æœˆ31æ—¥

Optimize Google BigQuery

I love BigQuery and think It is one of the best products ever made by the Google Cloud Platform. As someone who worksâ€¦

6 æ¡è¯„è®º
Stateful transformations in Spark Streaming - Part 1

2023å¹´2æœˆ26æ—¥

Stateful transformations in Spark Streaming - Part 1

In the previous article of this series i.e.

7 æ¡è¯„è®º
Kafka for Data Engineers

2023å¹´2æœˆ25æ—¥

Kafka for Data Engineers

Kafka is the prominent queuing system which is the most used technology in all streaming solutions. In most of myâ€¦

6 æ¡è¯„è®º

See all articles

Solve Small File Problem using Apache Hudi

Ankur Ranjan

Software Engineer by heart, Data Engineer by mind

???????? ???????? ???????????? ??????????????????: For COW

é¢†è‹±æŽ¨è

The Big Data Show

18,232 ä½å…³æ³¨è€…

Ankur Ranjançš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Parquet file format â€“ everything you need to know!

Understanding Apache Iceberg Delete Files

Delta Lake Hits 20 Million Monthly Downloads and Unveils Groundbreaking Features in 4.0.0 Release

Unlocking NYC Taxi Data Insights: Data Analysis with Spark, Delta Lake, PostgreSQL, Docker, and Apache Superset

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

Tackling the â€œLarge Number of Small Filesâ€ Problem in Spark

Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

DBT: Capture The In-house Data Flow

State of data catalogs 2023: The battle for your metadata

???????? ???????? ???????????? ??????????????????: For COW

é¢†è‹±æŽ¨è

The Big Data Show

18,232 ä½å…³æ³¨è€…

Ankur Ranjançš„æ›´å¤šæ–‡ç«

Apache Arrow Flight

Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

Spark Dynamic Resource Allocation

Intro to Kafka Security for Data Engineers - Part 1

Apache Hudi: Copy on Write(CoW) Table

Data Swamp - A problem arises due to the love life of Data Engineers.

Supercharging Apps with Polyglot Persistence: A Simple Guide

Optimize Google BigQuery

Stateful transformations in Spark Streaming - Part 1

Kafka for Data Engineers

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Parquet file format â€“ everything you need to know!

Understanding Apache Iceberg Delete Files

Delta Lake Hits 20 Million Monthly Downloads and Unveils Groundbreaking Features in 4.0.0 Release

Unlocking NYC Taxi Data Insights: Data Analysis with Spark, Delta Lake, PostgreSQL, Docker, and Apache Superset

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

Tackling the â€œLarge Number of Small Filesâ€ Problem in Spark

Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

DBT: Capture The In-house Data Flow

State of data catalogs 2023: The battle for your metadata

é¢†è‹±æŽ¨è

18,232 ä½å…³æ³¨è€…

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Tackling the â€œLarge Number of Small Filesâ€ Problem in Spark