Which runs first, Combiner or Partitioner in a MapReduce Job

不羁岁月 提交于 2019-12-11 11:16:08

问题


I am confused since I have found two answers for it.

1) As per Hadoop Definitive Guide - 3rd edition, Chapter 6 - The Map Side says: "Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the back-ground thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.

2)Yahoo developers tutorial (Yahoo tutorial) says Combiner runs prior to partitioner.

Can anyone please clarify which runs first.


回答1:


A Map Reduce Job may contain one or all of these phases

  1. Map

  2. Combine

  3. Shuffle and Sort

  4. Reduce

Partitioner fits between second and third phase

You can visit this link for more details.

After going through related SE questions & articles,

What runs first: the partitioner or the combiner?

Who will get a chance to execute first , Combiner or Partitioner?

https://sreejithrpillai.wordpress.com/2014/11/24/implementing-partitioners-and-combiners-for-mapreduce/

we can see that opinion is divided.

But logically I feel that

  1. Mapper write outputs to Circular ring buffer in memory
  2. If Number of reducers are more than 1 & partitioner is in place, mapper output will be partitioned
  3. Once the buffer memory is full, output will be spilled over to the disk
  4. As per hadoop definitive guide "Within each partition, the back-ground thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort"

It implies that Partitioner should run first and combiner has to run on output data with-in each partition.



来源:https://stackoverflow.com/questions/35195101/which-runs-first-combiner-or-partitioner-in-a-mapreduce-job

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!