Behavior of the parameter “mapred.min.split.size” in HDFS

前端 未结 2 601
轮回少年
轮回少年 2020-12-02 18:05

The parameter \"mapred.min.split.size\" changes the size of the block in which the file was written earlier? Assuming a situation where I, when starting my JOB, pass the par

2条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-12-02 18:25

    Assume that the minimum split size is defined 128mb and the minimum block size is defined 64mb.

    NOTE: As each block will be replicated to 3 different datanodes by HDFS by default. Also each map task performs its operation on single block.

    Hence, the 128mb split size will consider 2 blocks as a single block and create a single map task for it that will run on a single datanode. This happens at the cost of data-locality. By "cost of data-locality" I am talking about the block that is residing on the datanode on which the map task is not running. Which has to be fetched from that datanode and processed on the datanode on which the map task is running, resulting in lower performance.

    However if we consider a file of size 128mb, with default block size of 64mb and a default minimum split size of 64mb, then in that case as normally happens two map tasks will be created for each 64mb of block.

提交回复
热议问题