Split size vs Block size in Hadoop

前端未结

关注

 3  823

隐瞒了意图╮ 2020-12-01 03:00

What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? I

3条回答

暖寄归人 (楼主)

2020-12-01 03:58
- Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each)
- If the HDFS Block Size is configured as 128MB, then the 4 records will not be distributed among the blocks evenly. It will look like this.
- Block 1 contains the entire first record and a 28MB chunk of the second record.
- If a mapper is to be run on Block 1, the mapper cannot process since it won't have the entire second record.
- This is the exact problem that input splits solve. Input splits respects logical record boundaries.
- Lets Assume the input split size is 200MB
- Therefore the input split 1 should have both the record 1 and record 2. And input split 2 will not start with the record 2 since record 2 has been assigned to input split 1. Input split 2 will start with record 3.
- This is why an input split is only a logical chunk of data. It points to start and end locations with in blocks.
- If the input split size is n times the block size, an input split could fit multiple blocks and therefore less number of Mappers needed for the whole job and therefore less parallelism. (Number of mappers is the number of input splits)
- input split size = block size is the ideal configuration.
Hope this helps.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...