What is the default size that each Hadoop mapper will read?

后端 未结 1 1479
轮回少年
轮回少年 2021-01-01 05:55

Is it the block size of 64 MB for HDFS? Is there any configuration parameter that I can use to change it?

For a mapper reading gzip files, is it true that the number

1条回答
  •  梦毁少年i
    2021-01-01 06:59

    This is dependent on your:

    • Input format - some input formats (NLineInputFormat, WholeFileInputFormat) work on boundaries other than the block size. In general though anything extended from FileInputFormat will use the block boundaries as guides
    • File block size - the individual files don't need to have the same block size as the default blocks size. This is set when the file is uploaded into HDFS - if not explicitly set, then the default block size (at the time of upload) is applied. Any changes to the default / system block size after the file is will have no effect in the already uploaded file.
    • The two FileInputFormat configuration properties mapred.min.split.size and mapred.max.split.size usually default to 1 and Long.MAX_VALUE, but if this is overridden in your system configuration, or in your job, then this will change the amunt of data processed by each mapper, and the number of mapper tasks spawned.
    • Non-splittable compression - such as gzip, cannot be processed by more than a single mapper, so you'll get 1 mapper per gzip file (unless you're using something like CombineFileInputFormat, CompositeInputFormat)

    So if you have file with a block size of 64m, but either want to process more or less than this per map task, then you should just be able to set the following job configuration properties:

    • mapred.min.split.size - larger than the default, if you want to use less mappers, at the expense of (potentially) losing data locality (all data processed by a single map task may now be on 2 or more data nodes)
    • mapred.max.split.size - smaller than default, if you want to use more mappers (say you have a CPU intensive mapper) to process each file

    If you're using MR2 / YARN then the above properties are deprecated and replaced by:

    • mapreduce.input.fileinputformat.split.minsize
    • mapreduce.input.fileinputformat.split.maxsize

    0 讨论(0)
提交回复
热议问题