Hadoop: Does using CombineFileInputFormat for small files gives performance improvement?

為{幸葍}努か 提交于 2020-01-07 01:53:09

问题


I am new to hadoop and peforming some tests on local machine.

There have been many solutions to deal with many small files. I am using CombinedInputFormat which extends CombineFileInputFormat.

I see that number of mapper have changed from 100 to 25 with CombinedInputFormat. Should I also expect any performance gain since number of mappers have reduced?

I have performed the map-reduce job on many small files without CombinedInputFormat: 100 mappers took 10 minutes

But when the map-reduce job was executed with CombinedInputFormat: 25 mappers took 33 minutes.

Any help will be appreciated.


回答1:


Hadoop performs better with a small number of large files, as opposed to a huge number of small files. ("Small" here means significantly smaller than a Hadoop Distributed File System (HDFS) block."Number" means ranging to 1000s).

That means if you have 1000 1Mb size file the Map-reduce job based on normal TextInputFormat will create 1000 Map tasks, each of these map tasks require certain amount of time to start and end. This latency in task creation can reduce the performance of the job

In a multi tenant cluster with resource limitation, getting large number of Map slots also will be difficult.

Please refer this link for more details and Benchmark results.



来源:https://stackoverflow.com/questions/36107504/hadoop-does-using-combinefileinputformat-for-small-files-gives-performance-impr

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!