登录查看更多内容

3 Solutions for Big Data’s Small Files Problem !

Kumar Chinnakali

Reimagining contact center as a hands-on architect bridging users, clients, developers, and business executives in their context.

发布日期: 2016年9月9日

In this blog we will be discussion on the efficient solutions to the “small files” problem. And what is a small file in a Big Data Hadoop environment? In the Hadoop world, a small file is a file whose size is much smaller than the HDFS block size. The default HDFS block size is 64 MB, so for an example a 2 MB, 3 MB, 5 MB, or 7 MB file is considered a small file. But however the block size is configurable and it is defined by a parameter called dfs.block.size.

In general Hadoop handles big files very well, but when the files are small, it just passes each small file to a map () function, which is not very efficient because it will create a large number of mappers. For example, the 1,000’s files of size (2 to 3 MB) will need 1,000 mappers which very inefficient. Having too many small files can therefore be problematic in Hadoop. To solve this problem, we should merge many of these small files into one and then process them. And note that Hadoop is mainly designed for batch-processing a large volume of data rather than processing many small files. The main purpose of solving the small files problem is to speed up the execution of a Hadoop program by combining small files into bigger files. Solving the small files problem will shrink the number of map() functions executed and hence will improve the overall performance of a Hadoop job.

Solution 1: using a custom merge of small files
Solution 2: using a custom implementation of CombineFileInputFormat<K,V>
Solution 3: The filecrush tool is another possible solution.

Solution 1: using a custom merge of small files:

This solution merges small files into big files on the client side. Let us assume that we have to process 20,000 small files with assumptions of each smaller than 64 MB and we want to process them efficiently in the Big Data Hadoop environment. If we just send these files as input via FileInputFormat.addInputPath(Job, Path), then each input file will be sent to a mapper and we will end up with 20,000 mappers, which is very inefficient. Let dfs.block.size be 64 MB. Further assume that the size of these files is between 2 and 3 MB. And further, assume that we have M (such as 100, 200, 300, …) mappers available to us. The following multithreaded algorithm will solve the small files problem. Since our small files on average occupy 2.5 MB, we can put 25 (25 × 2.5 ≈ 64 MB) small files into one HDFS block, which we call a bucket. Now we just need 800 (20,000 ÷ 25 = 800) mappers, which will be very efficient compared to 20,000 mappers. Our algorithm puts N files into each bucket and then concurrently merges these small files into one file whose size is closer to the dfs.block.size.

Before submitting our small files to MapReduce/Hadoop, we merge them into big ones; we then submit these to the MapReduce driver program. And the below are the major classes: the SmallFilesConsolidator class accepts a set of small Hadoop files and then merges these small files together into larger Hadoop files whose size is less than or equal to dfs.block.size. And the BucketThread class enables us to concatenate small files into one big file whose size is smaller than the HDFS block size.

Solution 2: using a custom implementation of CombineFileInputFormat<K,V>:

And this solution uses the Hadoop API (the abstract class CombineFileInputFormat) to solve the small files problem. The idea behind the abstract class CombineFileInputFormat is to enable combining small files into Hadoop’s splits (or chunks) by using a custom InputFormat. To use the abstract class CombineFileInputFormat, we have to provide/implement three custom classes: CustomCFIF extends CombineFileInputFormat, PairOfStringLong is a Writable class, and CustomRecordReader is a custom RecordReader.

Solution 3: The filecrush tool is another possible solution:

This method will turn many small files into fewer larger ones. Also change from text to sequence and other compression options in one pass. Crush consumes directories containing many small files with the same key and value types and creates fewer, larger files containing the same data.

Crush is gives us the control to:

Name the output files
Ignore files that are “big enough”
Limit the size of each output file
Control the output compression codec
Swap smaller files with generated large files in-place
No long-running task problem

The hadoop filecrush tool can be used as a map reduce job or stand alone program. The file crush tool navigates entire file tree (or just a single folder) and decides which files are below a threshold and combines those into bigger files. The file crush tool works with sequence or text files. It can work with any type of sequence files regardless of Key or Value type.

Reference – Data Algorithms, Mahmoud Parsian.

Interesting? Please subscribe to our blogs at www.dataottam.com to keep yourself trendy on Big Data, Analytics, and IoT.

And as always please feel free to suggest or comment [email protected].

Keep Reading

Yogesh Kulkarni

Co-Founder and Chief Technology Officer at Emergys Solutions (formerly known as Ellicium Solutions Private Limited)

8 年

I concur with you Kumar. We have seen this happen in Production. It applies to not just MapReduce but relates to other areas of Hadoop as well. We had an interesting case in Production where we were using Flume to ingest data into HBase. We noticed that as the volume of files increased, the rate of ingestion was not keeping pace. We did an in-depth analysis and figured that we were facing the Small Files problem. We added a step to concatenate files prior to passing to Flume and voila! The ingestion rate increased multifold. We also published a blog on this - https://www.dhirubhai.net/pulse/how-we-increased-performance-flume-another-700-shubham-shirude?trk=prof-post

1 次回应

pradeep Chowdary chaparla

Data engineer

8 年

good

1 次回应

查看更多评论

要查看或添加评论，请登录

Kumar Chinnakali的更多文章

Dad, What's Net Zero? Exploring a Greener World with Dad's Help!" ????

2023年9月16日

Dad, What's Net Zero? Exploring a Greener World with Dad's Help!" ????

Yesterday, my 13-year-old daughter Anuja Kumar looked at me, her eyes filled with curiosity, and asked, 'Dad, what…
Pioneering the Future of Telecom with Bio-Inspired Computation

2023年9月16日

Pioneering the Future of Telecom with Bio-Inspired Computation

Bio-Inspired Computation in Telecommunications refers to the application of principles and algorithms inspired by…
?? Closing the Loop: Navigating Data's End of Life in the Net Zero Era

2023年9月3日

?? Closing the Loop: Navigating Data's End of Life in the Net Zero Era

Introduction In a world increasingly concerned with sustainability and environmental responsibility, even our digital…
?? Book Review: Practical Sustainability by Corey Glickman and Jeff Kavanaugh - A Gateway to Net Zero Insights!

2023年8月26日

?? Book Review: Practical Sustainability by Corey Glickman and Jeff Kavanaugh - A Gateway to Net Zero Insights!

?? "Practical Sustainability: Circular Commerce, Smarter Spaces, And Happier Humans" is a true gem that illuminates the…

2 条评论
Unraveling ESG Frameworks & Standards and the Power of Rankers & Raters: Navigating the Path to Net Zero

2023年8月20日

Unraveling ESG Frameworks & Standards and the Power of Rankers & Raters: Navigating the Path to Net Zero

In the realm of sustainable finance and responsible investing, the terms "ESG Frameworks & Standards" and "Rankers &…

1 条评论
From LCA to LCSA: A Journey Towards Holistic Sustainability in the Net Zero Era!

2023年8月5日

From LCA to LCSA: A Journey Towards Holistic Sustainability in the Net Zero Era!

In this article, we'll explore how Life Cycle Analysis (LCA) is evolving into Life Cycle Sustainability Analysis (LCSA)…

3 条评论
The Sustainability Showdown: ESG vs. ECG - Exploring New Dimensions of Responsible Business Practices

2023年7月16日

The Sustainability Showdown: ESG vs. ECG - Exploring New Dimensions of Responsible Business Practices

In today's ever-evolving business landscape, sustainability and ethical practices are taking center stage. As…

2 条评论
Unlocking Climate Action: The Power of Carbon Offsetting

2023年6月25日

Unlocking Climate Action: The Power of Carbon Offsetting

?? Introduction: In the fight against climate change, the need for immediate action is more pressing than ever. One…

1 条评论
Beyond GDP: Exploring the Inclusive Wealth Index for Sustainable Development and Well-Being.

2023年6月11日

Beyond GDP: Exploring the Inclusive Wealth Index for Sustainable Development and Well-Being.

The Inclusive Wealth Index (IWI) is an indicator that attempts to measure the sustainable development and well-being of…

1 条评论
Reducing Data Waste for a More Sustainable Future: Strategies for Efficient Data Management and Net Zero Impact

2023年5月29日

Reducing Data Waste for a More Sustainable Future: Strategies for Efficient Data Management and Net Zero Impact

Data waste refers to the inefficient or unnecessary use of data, resulting in the accumulation of large amounts of…

1 条评论

See all articles

3 Solutions for Big Data’s Small Files Problem !

Kumar Chinnakali

Reimagining contact center as a hands-on architect bridging users, clients, developers, and business executives in their context.

Kumar Chinnakali的更多文章

社区洞察

其他会员也浏览了

Hadoop File Formats, when and what to use?

Understanding the Varied Components of Hadoop and Benefits!

How "HADOOP" revolutionised Data Processing

Let’s research and the world the know about the Myths of Hadoop

The Future of Hadoop

Integrating LVM with Hadoop and providing Elasticity to DataNode Storage

Your Big Data Road Map :: Beyond Hadoop [Part II: Technology selections]

How Client put the file, read the file in Hadoop Cluster and How it retrieve data when DataNode is crashed

Big Data Quick Tricks(Hive-Fixing Small File Issue)

Market Mania of Hadoop & Big Data

Kumar Chinnakali的更多文章

Dad, What's Net Zero? Exploring a Greener World with Dad's Help!" ????

Pioneering the Future of Telecom with Bio-Inspired Computation

?? Closing the Loop: Navigating Data's End of Life in the Net Zero Era

?? Book Review: Practical Sustainability by Corey Glickman and Jeff Kavanaugh - A Gateway to Net Zero Insights!

Unraveling ESG Frameworks & Standards and the Power of Rankers & Raters: Navigating the Path to Net Zero

From LCA to LCSA: A Journey Towards Holistic Sustainability in the Net Zero Era!

The Sustainability Showdown: ESG vs. ECG - Exploring New Dimensions of Responsible Business Practices

Unlocking Climate Action: The Power of Carbon Offsetting

Beyond GDP: Exploring the Inclusive Wealth Index for Sustainable Development and Well-Being.

Reducing Data Waste for a More Sustainable Future: Strategies for Efficient Data Management and Net Zero Impact

社区洞察

其他会员也浏览了

Hadoop File Formats, when and what to use?

Understanding the Varied Components of Hadoop and Benefits!

How "HADOOP" revolutionised Data Processing

Let’s research and the world the know about the Myths of Hadoop

The Future of Hadoop

Integrating LVM with Hadoop and providing Elasticity to DataNode Storage

Your Big Data Road Map :: Beyond Hadoop [Part II: Technology selections]

How Client put the file, read the file in Hadoop Cluster and How it retrieve data when DataNode is crashed

Big Data Quick Tricks(Hive-Fixing Small File Issue)

Market Mania of Hadoop & Big Data