About MR inputsplit

瘦欲@ 提交于 2019-12-10 10:55:03

问题


As i understood that File splitting at the time of copying the file into HDFS and input splits on file for mapper input are entirely two different approaches.

Here it is my Question--

Suppose if my File1 size is 128MB which was split ted into two blocks and stored in two different data nodes (Node1,Node2) in hadoop cluster. I want to run MR job on this file and got two input splits of the sizes are 70MB and 58 MB respectively. First mapper will run on node1 by taking the input split data (Of size 70 MB) but Node1 has 64MB data only and remaining 6 MB data presented in Node2.

To complete Map task on Node1, Does hadoop transfer 6MB of data from Node2 to Node1? if yes, what if Node1 do not have enough storage to store 6MB data from Node2.

My apologies if my concern is awkward.


回答1:


64 MB of data will be written in Node 1 and 6 MB of data will be written in Node 2.

Map Reduce algorithm does not work on physical blocks of the file. It works on logical input splits. Input split depends on where the record was written. A record may span two Mappers.

In your example, assume that record start after 63 KB of data and length of record is 2 MB. In that case, 1 MB is part of Node 1 and other 1 MB is part of Node 2. Other 1 MB of data will be transferred from Node 2 to Node 1 during Map operation.

Have a look at below picture for better understanding of logical split Vs physical blocks

Have a look at some SE questions :

How does Hadoop process records split across block boundaries?

About Hadoop/HDFS file splitting

MapReduce data processing is driven by this concept of input splits. The number of input splits that are calculated for a specific application determines the number of mapper tasks.

Each of these mapper tasks is assigned, where possible, to a slave node where the input split is stored. The Resource Manager (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits are processed locally.

If data locality can't be achieved due to input splits crossing boundaries of data nodes, some data will be transferred from one Data node to other Data node.



来源:https://stackoverflow.com/questions/33763772/about-mr-inputsplit

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!