How does Hadoop decide how many nodes will perform the Map and Reduce tasks?

感情迁移 提交于 2019-11-28 13:12:50

How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

If you havve 10TB of input data and a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

How Many Reduces?

The right number of reduces seems to be 0.95 or 1.75 multiplied by ( < no. of nodes > * < no. of maximum containers per node > ).

With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

Reducer NONE

It is legal to set the number of reduce-tasks to zero if no reduction is desired

Which nodes for Reduce tasks?

You can configure number of mappers and number of reducers per node as per Configuration parameters like mapreduce.tasktracker.reduce.tasks.maximum

if you set this parameter as zero, that node won't be considered for Reduce tasks. Otherwise, all nodes in the cluster are eligible for Reduce tasks.

Source : Map Reduce Tutorial from Apache.

Note: For a given Job, you can set mapreduce.job.maps & mapreduce.job.reduces. But it may not be effective. We should leave the decisions to Map Reduce Framework to decide on number of Map & Reduce tasks

EDIT:

How to decide which Reducer node?

Assume that you have equal reduce slots available on two nodes N1 and N2 and current load on N1 > N2, then , Reduce task will be assigned to N2. If both load and number of slots are same, whoever sends first heartbeat to resource manager will get the task. This is the code block for reduce assignment:http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapred/JobQueueTaskScheduler.java#207

how does Hadoop decides how many nodes will do map tasks

By default, the number of mappers will be same as the number of split (blocks) of the input to the mapreduce.

Now about the nodes, In the Hadoop 2, each node runs it own NodeManager (NM). The job of the NM is to manage the application container assigned to it by the Resourcemanager (RM). So basically, each of the task will be running in the individual container. To run the mapper tasks, ApplicationMaster negotiate the container from the ResourceManager. Once the containers are allocated, the NodeManager will launch the task and monitor it.

which nodes will do the reduce tasks?

Again the reduce tasks will also runs in the containers. The ApplicationMaster (per-application (job)) will negotiate the containers from the RM and launch the reducer tasks. Mostly they run on the different nodes then the Mapper nodes.

The default number of reducers for any job is 1. The number of reducers can be set in the job configuration.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!