Hadoop mapper reading from 2 different source input files

问题

I have tool which chains a lot of Mappers & Reducers, and at some point I need merge results from previous map-reduce steps, for example as input I have two files with data:

/input/a.txt
apple,10
orange,20

*/input/b.txt*
apple;5
orange;40

result should be c.txt, where c.value = a.value * b.value

/output/c.txt
apple,50   // 10 * 5
orange,800 // 40 * 20

How it could be done? I've resolved this with introducing simple Key => MyMapWritable (type=1,2, value), and merging (actually, multiplying) data in reducers. It works, but:

have feeling that it could be done easier (smells not good)
is it possible somehow to know inside Mapper which exactly file was used as record provider (a.txt or b.txt). For now, I just used different separators: coma & semicolon :(

回答1:

Assuming they have been partitioned and sorted in the same way, then you can use the CompositeInputFormat to perform a map-side-join. There's an article on using it here. I don't think it's been ported to the new mapreduce api though.

Secondly, you can get the input file in the mapper by calling context.getInputSplit(), this will return the InputSplit, which if you're using TextInputFormat, you can cast to a FileInputSplit and then call getPath() to get the file name. I don't think you can use this method with CompositeInputFormat though as you won't know where the Writables in the TupleWritable have come from.

回答2:

String fileName = ((FileSplit) context.getInputSplit()).getPath()
                .toString();

if (fileName.contains("file_1")) {
   //TODO for file 1
} else {
   //TODO for file 2
}

来源：https://stackoverflow.com/questions/11495193/hadoop-mapper-reading-from-2-different-source-input-files

标签

Hadoop

MapReduce