map reduce output files: part-r-* and part-*

问题

I have some questions about map reduce output part files.

回答1:

Normally, part-r-* comes from the reducer. MultipleOutputs allows you to use a different naming convention. If there is no reduce step, the output will be part-m-*. As I understand it, if there is a reducer defined, the mapper outputs are deleted regardless of if the reducers produce anything. Usually the reducer output files will be produced as well even if they are empty, unless you use LazyOutputFormat. Where did you find part-* files that did not end with either m-nnnnn or r-nnnnn ?

回答2:

For old versions (< 0.2), they used to output only part-000*. But now, we see both part-m-n* (n representing number ex: part-m-00000) and part-r-n* files. part-r-n* is for output from the reducers. part-m-n* is the output from combiners. (If I don't use a combiner, I don't get any part-m-n*. I am not sure if it's a default behaviour.)

回答3:

part-00000 is the output directories created by mappers or reducers in the Old API. In the new API it was slightly changed to part-m-* for mapper outputs and part-r-* for reducers output. For more details refer the Hadoop Definitive Guide from OReilly, page number 28.

来源：https://stackoverflow.com/questions/10924852/map-reduce-output-files-part-r-and-part

标签

Hadoop

MapReduce