3 Solutions for Big Data’s Small Files Problem !
Kumar Chinnakali
Reimagining contact center as a hands-on architect bridging users, clients, developers, and business executives in their context.
In this blog we will be discussion on the efficient solutions to the “small files” problem. And what is a small file in a Big Data Hadoop environment? In the Hadoop world, a small file is a file whose size is much smaller than the HDFS block size. The default HDFS block size is 64 MB, so for an example a 2 MB, 3 MB, 5 MB, or 7 MB file is considered a small file. But however the block size is configurable and it is defined by a parameter called dfs.block.size.
In general Hadoop handles big files very well, but when the files are small, it just passes each small file to a map () function, which is not very efficient because it will create a large number of mappers. For example, the 1,000’s files of size (2 to 3 MB) will need 1,000 mappers which very inefficient. Having too many small files can therefore be problematic in Hadoop. To solve this problem, we should merge many of these small files into one and then process them. And note that Hadoop is mainly designed for batch-processing a large volume of data rather than processing many small files. The main purpose of solving the small files problem is to speed up the execution of a Hadoop program by combining small files into bigger files. Solving the small files problem will shrink the number of map() functions executed and hence will improve the overall performance of a Hadoop job.
- Solution 1: using a custom merge of small files
- Solution 2: using a custom implementation of CombineFileInputFormat<K,V>
- Solution 3: The filecrush tool is another possible solution.
Solution 1: using a custom merge of small files:
This solution merges small files into big files on the client side. Let us assume that we have to process 20,000 small files with assumptions of each smaller than 64 MB and we want to process them efficiently in the Big Data Hadoop environment. If we just send these files as input via FileInputFormat.addInputPath(Job, Path), then each input file will be sent to a mapper and we will end up with 20,000 mappers, which is very inefficient. Let dfs.block.size be 64 MB. Further assume that the size of these files is between 2 and 3 MB. And further, assume that we have M (such as 100, 200, 300, …) mappers available to us. The following multithreaded algorithm will solve the small files problem. Since our small files on average occupy 2.5 MB, we can put 25 (25 × 2.5 ≈ 64 MB) small files into one HDFS block, which we call a bucket. Now we just need 800 (20,000 ÷ 25 = 800) mappers, which will be very efficient compared to 20,000 mappers. Our algorithm puts N files into each bucket and then concurrently merges these small files into one file whose size is closer to the dfs.block.size.
Before submitting our small files to MapReduce/Hadoop, we merge them into big ones; we then submit these to the MapReduce driver program. And the below are the major classes: the SmallFilesConsolidator class accepts a set of small Hadoop files and then merges these small files together into larger Hadoop files whose size is less than or equal to dfs.block.size. And the BucketThread class enables us to concatenate small files into one big file whose size is smaller than the HDFS block size.
Solution 2: using a custom implementation of CombineFileInputFormat<K,V>:
And this solution uses the Hadoop API (the abstract class CombineFileInputFormat) to solve the small files problem. The idea behind the abstract class CombineFileInputFormat is to enable combining small files into Hadoop’s splits (or chunks) by using a custom InputFormat. To use the abstract class CombineFileInputFormat, we have to provide/implement three custom classes: CustomCFIF extends CombineFileInputFormat, PairOfStringLong is a Writable class, and CustomRecordReader is a custom RecordReader.
Solution 3: The filecrush tool is another possible solution:
This method will turn many small files into fewer larger ones. Also change from text to sequence and other compression options in one pass. Crush consumes directories containing many small files with the same key and value types and creates fewer, larger files containing the same data.
Crush is gives us the control to:
- Name the output files
- Ignore files that are “big enough”
- Limit the size of each output file
- Control the output compression codec
- Swap smaller files with generated large files in-place
- No long-running task problem
The hadoop filecrush tool can be used as a map reduce job or stand alone program. The file crush tool navigates entire file tree (or just a single folder) and decides which files are below a threshold and combines those into bigger files. The file crush tool works with sequence or text files. It can work with any type of sequence files regardless of Key or Value type.
Reference – Data Algorithms, Mahmoud Parsian.
Interesting? Please subscribe to our blogs at www.dataottam.com to keep yourself trendy on Big Data, Analytics, and IoT.
And as always please feel free to suggest or comment [email protected].
Co-Founder and Chief Technology Officer at Emergys Solutions (formerly known as Ellicium Solutions Private Limited)
8 年I concur with you Kumar. We have seen this happen in Production. It applies to not just MapReduce but relates to other areas of Hadoop as well. We had an interesting case in Production where we were using Flume to ingest data into HBase. We noticed that as the volume of files increased, the rate of ingestion was not keeping pace. We did an in-depth analysis and figured that we were facing the Small Files problem. We added a step to concatenate files prior to passing to Flume and voila! The ingestion rate increased multifold. We also published a blog on this - https://www.dhirubhai.net/pulse/how-we-increased-performance-flume-another-700-shubham-shirude?trk=prof-post
Data engineer
8 年good