map reduce output files: part-r-* and part-*

喜夏-厌秋 提交于 2019-12-11 18:43:23

问题


I have some questions about map reduce output part files.

    1> What are the differences between part-r-* files and part-* files in map reduce output? part-r-* is output from mapper and part-* is from reducer?
    2> If reducer doesn't produce any results, mapper output will be staying or will be deleted?

回答1:


Normally, part-r-* comes from the reducer. MultipleOutputs allows you to use a different naming convention. If there is no reduce step, the output will be part-m-*. As I understand it, if there is a reducer defined, the mapper outputs are deleted regardless of if the reducers produce anything. Usually the reducer output files will be produced as well even if they are empty, unless you use LazyOutputFormat. Where did you find part-* files that did not end with either m-nnnnn or r-nnnnn ?




回答2:


For old versions (< 0.2), they used to output only part-000*. But now, we see both part-m-n* (n representing number ex: part-m-00000) and part-r-n* files. part-r-n* is for output from the reducers. part-m-n* is the output from combiners. (If I don't use a combiner, I don't get any part-m-n*. I am not sure if it's a default behaviour.)




回答3:


part-00000 is the output directories created by mappers or reducers in the Old API. In the new API it was slightly changed to part-m-* for mapper outputs and part-r-* for reducers output. For more details refer the Hadoop Definitive Guide from OReilly, page number 28.



来源:https://stackoverflow.com/questions/10924852/map-reduce-output-files-part-r-and-part

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!