Hadoop input split size vs block size

后端 未结 7 926
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-01 01:27

I am going through hadoop definitive guide, where it clearly explains about input splits. It goes like

Input splits doesn’t contain actual data, rath

7条回答
  •  青春惊慌失措
    2020-12-01 02:18

    To 1) and 2): i'm not 100% sure, but if the task cannot complete - for whatever reason, including if something is wrong with the input split - then it is terminated and another one started in it's place: so each maptask gets exactly one split with file info (you can quickly tell if this is the case by debugging against a local cluster to see what information is held in the input split object: I seem to recall it's just the one location).

    to 3): if the file format is splittable, then Hadoop will attempt to cut the file down to "inputSplit" size chunks; if not, then it's one task per file, regardless of the file size. If you change the value of minimum-input-split, then you can prevent there being too many mapper tasks that are spawned if each of your input files are divided into the block size, but you can only combine inputs if you do some magic with the combiner class (I think that's what it's called).

提交回复
热议问题