Is the input to a Hadoop reduce function complete with regards to its key?

[亡魂溺海] 提交于 2020-01-06 07:17:07

问题


I'm looking at solutions to a problem that involves reading keyed data from more than one file. In a single map step I need all the values for a particular key in the same place at the same time. I see in White's book the discussion about "the shuffle" and am tempted to wonder if when you come out of merging and the input to a reducer is sorted by key, if all the data for a key is there....if you can count on that.

The bigger pictures is that I want to do a poor-man's triple-store federation and the triples I want to load into an in-memory store don't all come from the same file. It's a vertical (?) partition where the values for a particular key are in different files. Said another way, the columns for a complete record each come from different files. Does Hadoop re-assemble that? ...at least for a single key at a time.


回答1:


In short: yes. In a Hadoop job, the partitioner chooses which reducer receives which (key, value) pairs. Quote from the Yahoo tutorial section on partitioning: "It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same". This is also necessary for many of the types of algorithms typically solved with map reduce (such as distributed sorting, which is what you're describing).



来源:https://stackoverflow.com/questions/8219350/is-the-input-to-a-hadoop-reduce-function-complete-with-regards-to-its-key

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!