Hadoop's input splitting- How does it work

前端 未结 2 1419
长情又很酷
长情又很酷 2021-01-16 20:15

I know brief about hadoop

I am curious to know how does it work.

To be precise I want to know, how exactly it divides/splits the input file.

Does it

2条回答
  •  [愿得一人]
    2021-01-16 20:42

    When you submit a map-reduce job (or pig/hive job), Hadoop first calculates the input splits, each input split size generally equals to HDFS block size. For example, for a file of 1GB size, there will be 16 input splits, if block size is 64MB. However, split size can be configured to be less/more than HDFS block size. Calculation of input splits is done with FileInputFormat. For each of these input splits, a map task must be started.

    But you can change the size of input split by configuring following properties:

    mapred.min.split.size: The minimum size chunk that map input should be split into.
    mapred.max.split.size: The largest valid size inbytes for a file split. 
    dfs.block.size: The default block size for new files.
    

    And the formula for input split is:

    Math.max("mapred.min.split.size", Math.min("mapred.max.split.size", blockSize));
    

    You can check examples here.

提交回复
热议问题