Hadoop to reduce from multiple input formats

China☆狼群 提交于 2019-12-09 19:21:46

问题


I have two files with different data formats in HDFS. How would a job set up look like, if I needed to reduce across both data files?

e.g. imagine the common word count problem, where in one file you have space as the world delimiter and in another file the underscore. In my approach I need different mappers for the various file formats, that than feed into a common reducer.

How to do that? Or is there a better solution than mine?


回答1:


Check out the MultipleInputs class that solves this exact problem. It's pretty neat-- you pass in the InputFormat and optionally the Mapper class.

If you are looking for code examples on google, search for "Reduce-side join", which is where this method is typically used.


On the other hand, sometimes I find it easier to just use a hack. For example, if you have one set of files that is space delimited and the other that is underscore delimited, load both with the same mapper and TextInputFormat and tokenize on both possible delimiters. Count the number of tokens from the two results set. In the word count example, pick the one with more tokens.

This also works if both files are the same delimiter but have a different number of standard columns. You can tokenize on comma then see how many tokens there are. If it is say 5 tokens it is from data set A, if it is 7 tokens it is from data set B.



来源:https://stackoverflow.com/questions/10213791/hadoop-to-reduce-from-multiple-input-formats

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!