发表新帖

发表新帖

Hadoop input split size vs block size

后端未结

关注

 7  935

爱一瞬间的悲伤 2020-12-01 01:27

I am going through hadoop definitive guide, where it clearly explains about input splits. It goes like

Input splits doesn’t contain actual data, rath

7条回答

春和景丽 (楼主)

2020-12-01 02:16

Block is the physical representation of data. Split is the logical representation of data present in Block.

Block and split size can be changed in properties.

Map reads data from Block through splits i.e. split act as a broker between Block and Mapper.

Consider two blocks:

Block 1

aa bb cc dd ee ff gg hh ii jj

Block 2

ww ee yy uu oo ii oo pp kk ll nn

Now map reads block 1 till aa to JJ and doesn't know how to read block 2 i.e. block doesn't know how to process different block of information. Here comes a Split it will form a Logical grouping of Block 1 and Block 2 as single Block, then it forms offset(key) and line (value) using inputformat and record reader and send map to process further processing.

If your resource is limited and you want to limit the number of maps you can increase the split size. For example: If we have 640 MB of 10 blocks i.e. each block of 64 MB and resource is limited then you can mention Split size as 128 MB then then logical grouping of 128 MB is formed and only 5 maps will be executed with a size of 128 MB.

If we specify split size is false then whole file will form one input split and processed by one map which it takes more time to process when file is big.

0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...

热议问题