Hadoop: number of available map slots based on cluster size

问题

Reading the syslog generated by Hadoop, I can see lines similar to this one..

2013-05-06 16:32:45,118 INFO org.apache.hadoop.mapred.JobClient (main): Setting default number of map tasks based on cluster size to : 84

Does anyone know how this value is computed? And how can I get this value in my program?

回答1:

I grepped the source code of Hadoop and did not find the string Setting default number of map tasks based on cluster size to at all (whereas I find other strings, which are being printed when running MR jobs). Furthermore this string is not being printed anywhere in my local installation. A google search for it listed problems on AWS with EMR. As you confirmed, your're in fact using Amazon Elastic MapReduce. I believe EMR has some own modifications to the JobClient class of Hadoop, which outputs this particular line.

As far as computing this number is concerned I would suspect it to be computed based on characteristics like total number of (active) nodes in cluster (N) and number of map slots per node (M), i.e. N*M. However, additional AWS-specific resource (memory) constraints may also be taken into account. You'd have to ask in EMR-related forums for the exact formula.

Additionaly, the JobClient exposes a set of information about the cluster. Using the method JobClient#getClusterStatus() you can access information like:

Size of the cluster.
Name of the trackers.
Number of blacklisted/active trackers.
Task capacity of the cluster.
The number of currently running map & reduce tasks.

via the ClusterStatus class object, so you can try and compute the desired number in your program manually.

回答2:

So this is set by default based on the size of your input. http://wiki.apache.org/hadoop/HowManyMapsAndReduces. You are allowed to specify more mappers, but not less than the number defined by hadoop.

You should be able to access this number by getting the configuration option "mapred.map.tasks". You can also get it from this function if you are using the old api.

conf.getNumMapTasks();

This previous question, How to set the number of map tasks in hadoop 0.20?, has some good answers as well

回答3:

It is primarily the InputFormat's duty to find the no. of mappers and it is done based on the InputSplits created by the logic written inside getSplits(JobContext context) method of your InputFormat class. Specifying the no. of mappers through the Job or config files or specifying it through the shell is just a hint to the framework and doesn't guarantee that you'll always get only the specified no. of mappers.

来源：https://stackoverflow.com/questions/16403576/hadoop-number-of-available-map-slots-based-on-cluster-size

标签

Hadoop

MapReduce

mapper