发表新帖

发表新帖

Hadoop input split size vs block size

后端未结

关注

 7  926

爱一瞬间的悲伤 2020-12-01 01:27

I am going through hadoop definitive guide, where it clearly explains about input splits. It goes like

Input splits doesn’t contain actual data, rath

7条回答

青春惊慌失措 (楼主)

2020-12-01 02:18

To 1) and 2): i'm not 100% sure, but if the task cannot complete - for whatever reason, including if something is wrong with the input split - then it is terminated and another one started in it's place: so each maptask gets exactly one split with file info (you can quickly tell if this is the case by debugging against a local cluster to see what information is held in the input split object: I seem to recall it's just the one location).

to 3): if the file format is splittable, then Hadoop will attempt to cut the file down to "inputSplit" size chunks; if not, then it's one task per file, regardless of the file size. If you change the value of minimum-input-split, then you can prevent there being too many mapper tasks that are spawned if each of your input files are divided into the block size, but you can only combine inputs if you do some magic with the combiner class (I think that's what it's called).

0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...

热议问题