问题
Please clarify
I have set of input files (say 10) with specific names. I run word count job on all files at once (input path is folder). I am expecting 10 output files with same names as input files. I.e. File1 input should be counted and should be stored in a separate output file with "file1" name. And so on to all files.
回答1:
There are 2 approaches you can take to achieve multiple outputs
Use MultipleOutputs class - refer this document for information about multipleclassoutput (https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html) , for more information about how to implement refer this http://appsintheopen.com/posts/44-map-reduce-multiple-outputs
Another option is using LazyOuputFormat, however, this is used in conjunction with multipleoutputs, for more information about its implementation refer this ( https://ssmolen.wordpress.com/2014/07/09/hadoop-mapreduce-write-output-to-multiple-directories-depending-on-the-reduce-key/ ).
I feel using LazyOutputFormat in conjunction with MultipleOuputs class is better approach.
回答2:
Set the number of reduce tasks to be equal to the number of input files. This will create the given number of output files, as well.
Add a file prefix to each map output key (word). E.g., when you meet the word "cat" in file named "file0.txt" you can emit the key "0_cat", or "file0_cat", or anything else that is unique for "file0.txt". Use the context to get each time the filename.
Override the default Partitioner, to make sure that all the map output keys with prefix "0_", or "file0_" will go to the first partition, all the keys with prefix "1_", or "file1_" will go to the second, etc.
In the reducer, remove the "x_" or "filex_" prefix from the output key and use it as the name of the output file (using MultipleOutputs). Otherwise, if you don't want MultipleOutputs, you can easily do the mapping between outputfiles and input files by checking your Partitioner code. (e.g., part-00000 will be the partition 0's output)
来源:https://stackoverflow.com/questions/39559830/mapreduce-one-to-one-processing-of-multiple-input-files