Block Size in hadoop

若如初见. 提交于 2019-12-12 03:26:01

问题


I am currently working on a four node multi cluster. Can anyone suggest me the appropriate block size for working on a 22GB input file? Thanks in advance.

Here are my performance results: 64M - 32 min. 128M - 19.4 min 256M - 15 min

Now, should I consider making it much larger to 1GB/2GB? Kindly explain if there are any issues if done so.

Edit: Also, if the performance increases with increasing block size for a 20GB input file why is the default block size 64MB or 128MB? Kindly answer similar question over here


回答1:


What is the split size that you are going to use for processing this file? If it's slightly more than the default block size, then i'd suggest you to change the block size to the split size value. This should increase the chances of data locality for mappers thereby improving the job throughput.

Split size is computed by input format.

    protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }

minSize and maxSize can be manipulated using the below configuration parameters,

mapreduce.input.fileinputformat.split.minsize

mapreduce.input.fileinputformat.split.maxsize

You can find the detailed data flow in the FileInputFormat class.




回答2:


How heavy is the per-line processing? If it were simply a kind of "grep" then you should be fine to increase the block size up to 1GB . Why not simply try it out? Your performance numbers indicate a positive result in increasing the block size already.

The consideration for smaller block sizes would be if each line requires significant ancillary processing. But that is doubtful given your already established performance trends.



来源:https://stackoverflow.com/questions/28134288/block-size-in-hadoop

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!