Who will get a chance to execute first , Combiner or Partitioner?

[亡魂溺海] 提交于 2019-12-06 01:44:36

1/ The response is already specified in this part: "Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort."

So firstly the partitions are created in-memory, if there is a custom combiner, it will be executed in-memory, and the result will be spilled to disk at the end.

2/ custom combiner and custom partition will be there when they are specified on the driver class.

job.setCombinerClass(MyCombiner.class);
job.setPartitionerClass(MyPartitioner.class);

If there is no custom combiner specified, so there is no combiner executed. If there is no custom partitioner specified, so the default executed partitioner is "HashPartitioner" (please see the page 221 for that).

3/ Yes, it is possible. Don't forget that the mechanism of the combiner is the same than the reducer. The reducer can consume compressed data. If the consumer consumes the compressed data, that means that the input files format is compressed. for that, you can specify on the driver class the instruction:

Sequence File case: job.setInputFormatClass(SequenceFileInputFormat.class);
Avro File case: job.setInputFormatClass(AvroKeyInputFormat.class); 

The direct answer to your question is => COMBINER

Details: Combiner can be viewed as mini-reducers in the map phase. They perform a local-reduce on the mapper results before they are distributed further. Once the Combiner functionality is executed, it is then passed on to the Reducer for further work.

where as

Partitioner come into the picture when we are working on more than on Reducer. So, the partitioner decide which reucer is responsible for a particular key. They basically take the Mapper Result(if Combiner is used then Combiner Result) and send it to the responsible Reducer based on the key.

For a better understanding you can refer the following image, which I have taken from Yahoo Developer Tutorial on Hadoop. Figure 4.6: Combiner step inserted into the MapReduce data flow

Here is the tutorial .

This is the complete MR job flow. Your 1.) and 2.) is answered here.

  1. Mapper reads the data and processes. This output goes to a intermediate output file.
  2. Once mapper finishes all the key, values pairs. The intermediate output is partitioned into 'R' partitions using either default partitioner 'HashPartitioner' or custom partitioner.
  3. Each partitioned file is sorted.
  4. Any optional combiner code is executed on the sorted 'R' partitions. The combiner step is executed only if it is specified.
  5. Reducers reach out to the mappers and pull their appropriate partitioned files.
  6. After all the mapper tasks completed and all the intermediate data is copied to all the reducers. The reducers perform one more sort on the data.
  7. Then reducers work on their individual key, value pairs one by one.

Answer-3: Yes, combiner can process the compressed data. The combiner function runs on the output of the map phase and is used as a filtering or an aggregating step to lessen the number of intermediate keys that are being passed to the reducer. In most of the cases the reducer class is set to be the combiner class. The difference lies in the output from these classes. The output of the combiner class is the intermediate data that is passed to the reducer whereas the output of the reducer is passed to the output file on disk. The combiner for job can be set like this:

job.setCombinerClass(CustomCombiner.class);

Partition runs before the Combinor. a) The mapper will processed the data into b) Followed by a partitioner ( either default or custom ) will partitioned the data as per requirement based on keys. c) Followed by sorting on keys which will be taken care by the background threads/process. d) If combinor exist : Then followed by combinor,This will run on the output of the sorted keys e) Followed by the Reducer which will run sort one more time on the input data followed by the reducer process.

Vikas Singh

I would like to summarize the entire flow:

  1. Mapper reads the data and processes. This output goes to a intermediate output file.
  2. Once mapper finishes all the key, values pairs.
  3. output of Mapper first writen to memory buffer,
  4. when buffer is about to overflow then spilled to local dir and then partitions are created in-memory["Within each partition, the background thread performs an in-memory sort by key and The intermediate output is partitioned into 'R' partitions using either default partitioner 'HashPartitioner' or custom partitioner]
  5. The spilling data is parted according to Partitioner, and in each partition the result is sorted and
  6. if there is a custom combiner, it will be executed in-memory, and the result will be spilled to disk at the end.
  7. Reducers reach out to the mappers and pull their appropriate partitioned files.
  8. After all the mapper tasks completed and all the intermediate data is copied to all the reducers. The reducers perform one more sort on the data.
  9. Then reducers work on their individual key, value pairs one by one.

Please suggest if any gap in my understanding

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!