Split size vs Block size in Hadoop

前端 未结 3 823
隐瞒了意图╮
隐瞒了意图╮ 2020-12-01 03:00

What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? I

3条回答
  •  暖寄归人
    2020-12-01 03:58

    • Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each)

    • If the HDFS Block Size is configured as 128MB, then the 4 records will not be distributed among the blocks evenly. It will look like this.

    • Block 1 contains the entire first record and a 28MB chunk of the second record.
    • If a mapper is to be run on Block 1, the mapper cannot process since it won't have the entire second record.

    • This is the exact problem that input splits solve. Input splits respects logical record boundaries.

    • Lets Assume the input split size is 200MB

    • Therefore the input split 1 should have both the record 1 and record 2. And input split 2 will not start with the record 2 since record 2 has been assigned to input split 1. Input split 2 will start with record 3.

    • This is why an input split is only a logical chunk of data. It points to start and end locations with in blocks.

    • If the input split size is n times the block size, an input split could fit multiple blocks and therefore less number of Mappers needed for the whole job and therefore less parallelism. (Number of mappers is the number of input splits)

    • input split size = block size is the ideal configuration.

    Hope this helps.

提交回复
热议问题