Hadoop setting maxium simultaneous map/reduce task does not work in Psedue mode

问题

I configured hadoop 2.4.1 in a single machine (4-core) to use the Psedue Distributed mode, and I am able to run my map/reduce program via the hadoop shell command on the HDFS input files.

But I notice that the map and reduce look like still running in single thread. So I tried to hard-code the properties mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum, both to 4. (Just for trying I know it is not ideal setting). But I still see the map and reduce tasks running in serial.

The way I configure is to modify the etc/hadoop/mapred-site.xml to include below:

<configuration>
    <property>
        <name> mapreduce.tasktracker.map.tasks.maximum </name>
        <value> 4 </value>
    </property>

    <property>
        <name> mapreduce.tasktracker.reduce.tasks.maximum </name>
        <value> 4 </value>
    </property>
</configuration>

And restart the TaskTracker node using command

sbin/hadoop-daemon.sh stop tasktracker
sbin/hadoop-daemon.sh start tasktracker

This follows the article here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W265aa64a4f21_43ee_b236_c42a1c875961/page/Tuning%20number%20of%20map%20and%20reduce%20slots%20on%20a%20TaskTracker%20node

And the way that I conclude it stills run in single-thread, is that I try to print something when a mapper object or a reduce object is constructed, by overriding the constructor. Then it shows that the mappers are constructed one by one evenly across the time mappers are running, and the reducers constructed also one by one evenly across the time.

What am I missing here?

回答1:

I figured out that starting and stopping the TaskTracker is no longer supported in my used version of Hadoop. There are two many confused information here and there for different versions and they mixed up.

After I configure and start the Yarn, it really looks like the map and reduce tasks are now run in certain concurrency. (setting according to https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/SingleCluster.html). When running a larger set of data (about 2 minutes running), running in 2 maximum map and 2 maximum reduce can bring about 10 seconds of improvement, and this makes some sense.

And to me, it also looks like the two parameters mapreduce.tasktracker.map.tasks.maximum & mapreduce.tasktracker.reduce.tasks.maximum does not take effect any more, though I do not see any document confirming that.

And instead, the Yarn takes all controls of the resource management, the concept of Slot is gone and comes the Container, and VCore, etc. The combined settings as shown below, determines how concurrent a node can be run.

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_yarn_tuning.html

This is my own understanding yet, but need more confirmation.

来源：https://stackoverflow.com/questions/33130749/hadoop-setting-maxium-simultaneous-map-reduce-task-does-not-work-in-psedue-mode

标签

java

multithreading

Hadoop

MapReduce