Hadoop input split size vs block size

后端 未结 7 929
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-01 01:27

I am going through hadoop definitive guide, where it clearly explains about input splits. It goes like

Input splits doesn’t contain actual data, rath

7条回答
  •  忘掉有多难
    2020-12-01 02:17

    • The answer by @user1668782 is a great explanation for the question and I'll try to give a graphical depiction of it.

    • Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each)

    • If the HDFS Block Size is configured as 128MB, then the 4 records will not be distributed among the blocks evenly. It will look like this.

    • Block 1 contains the entire first record and a 28MB chunk of the second record.
    • If a mapper is to be run on Block 1, the mapper cannot process since it won't have the entire second record.
    • This is the exact problem that input splits solve. Input splits respects logical record boundaries.

    • Lets Assume the input split size is 200MB

    • Therefore the input split 1 should have both the record 1 and record 2. And input split 2 will not start with the record 2 since record 2 has been assigned to input split 1. Input split 2 will start with record 3.

    • This is why an input split is only a logical chunk of data. It points to start and end locations with in blocks.

    Hope this helps.

提交回复
热议问题