Passing values from Mapper to Reducer

问题

There is a little amount of meta-data that I get by looking up the current file the mapper is working on (and a few other things). I need to send over this meta-data to the reducer. Sure, I can have the mapper emit this in the < Key, Value> pair it generates as < Key, Value + Meta-Data>, but I want to avoid it.

Also, constraining myself a little bit more, I do not want to use DistributedCahce. So, do I still have some options left? More precisely, my question is twofold

(1) I tried setting up some parameters by doing a job.set(Prop, Value) in my mapper's configure(JobConf) and doing a job.get() in my reducer's configure(JobConf). Sadly, I found it does not work. As one aside, I am interested in knowing why this behavior. My main question is

(2) How can I send the value from the mapper to the reducer in a "clean way" (if possible, within the constraints I want).

EDIT (In view of response by Praveen Sripati)

To make it more concrete, here is what I want. Based on the type of data emitted we want it stored under different files (say data d1 ends up in D1 and data d2 ends up in D2).

The values D1 and D2 can be read in config file and figuring out what goes where depends on the value of map.input.file. That is, the pair < k1, d1> after some processing should go to D1 and < k2,d2> should go to D2. I do not want to emit things like < k1, d1+D1>. Can, I somehow obtain figure out the association without emitting D1 or D2, maybe by cleverly using the config file? The input source (i.e., input directory) for k1,d1 and k2,d2 is the same which again can be seen only through map.input.file

Please let me know when you get time.

Regards
-Akash

回答1:

Based on the type of data emitted we want it stored under different directories (say data d1 ends up in D1 and data d2 ends up in D2).

Usually the o/p of the MR job will go to a single output folder. Each mapper/reducer will write to a separate file. I am not sure how to write an MR job o/p output to different directories without any changes to the Hadoop framework.

But, based on the output key/value types from the mapper/reducer the output file can be choosen. Use the subclasses of the MultipleOutputFormat. The MultipleOutputFormat#generateFileNameForKeyValue method has to be implemented, return a string based on the input key.

See how PartitionByStationUsingMultipleOutputFormat is implemented in the code of the Hadoop - The Definitive Guide book.

Once the job has been completed, the o/p can be moved easily using hadoop commands to a different directory.

来源：https://stackoverflow.com/questions/8950851/passing-values-from-mapper-to-reducer

标签

java

Hadoop

MapReduce