发表新帖

发表新帖

Hadoop input split size vs block size

后端未结

关注

 7  929

爱一瞬间的悲伤 2020-12-01 01:27

I am going through hadoop definitive guide, where it clearly explains about input splits. It goes like

Input splits doesn’t contain actual data, rath

7条回答

忘掉有多难 (楼主)

2020-12-01 02:17
- The answer by @user1668782 is a great explanation for the question and I'll try to give a graphical depiction of it.
- Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each)
- If the HDFS Block Size is configured as 128MB, then the 4 records will not be distributed among the blocks evenly. It will look like this.
- Block 1 contains the entire first record and a 28MB chunk of the second record.
- If a mapper is to be run on Block 1, the mapper cannot process since it won't have the entire second record.
- This is the exact problem that input splits solve. Input splits respects logical record boundaries.
- Lets Assume the input split size is 200MB
- Therefore the input split 1 should have both the record 1 and record 2. And input split 2 will not start with the record 2 since record 2 has been assigned to input split 1. Input split 2 will start with record 3.
- This is why an input split is only a logical chunk of data. It points to start and end locations with in blocks.
Hope this helps.
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...

热议问题