Hadoop mapper reading from 2 different source input files

一曲冷凌霜 提交于 2020-01-02 07:23:08

问题


I have tool which chains a lot of Mappers & Reducers, and at some point I need merge results from previous map-reduce steps, for example as input I have two files with data:

/input/a.txt
apple,10
orange,20

*/input/b.txt*
apple;5
orange;40

result should be c.txt, where c.value = a.value * b.value

/output/c.txt
apple,50   // 10 * 5
orange,800 // 40 * 20

How it could be done? I've resolved this with introducing simple Key => MyMapWritable (type=1,2, value), and merging (actually, multiplying) data in reducers. It works, but:

  1. have feeling that it could be done easier (smells not good)
  2. is it possible somehow to know inside Mapper which exactly file was used as record provider (a.txt or b.txt). For now, I just used different separators: coma & semicolon :(

回答1:


Assuming they have been partitioned and sorted in the same way, then you can use the CompositeInputFormat to perform a map-side-join. There's an article on using it here. I don't think it's been ported to the new mapreduce api though.

Secondly, you can get the input file in the mapper by calling context.getInputSplit(), this will return the InputSplit, which if you're using TextInputFormat, you can cast to a FileInputSplit and then call getPath() to get the file name. I don't think you can use this method with CompositeInputFormat though as you won't know where the Writables in the TupleWritable have come from.




回答2:


String fileName = ((FileSplit) context.getInputSplit()).getPath()
                .toString();

if (fileName.contains("file_1")) {
   //TODO for file 1
} else {
   //TODO for file 2
}


来源:https://stackoverflow.com/questions/11495193/hadoop-mapper-reading-from-2-different-source-input-files

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!