问题
Let's say we have N input files with different number of lines. We need to generate output files such the each output file has exactly K number of lines (except the last one which can have < K records).
- Is it possible to do this using single MR job?
- We should open the files for writing explicitly in reducer.
- The records in output should be shuffled.
thanks,
Paramesh
回答1:
Assuming that the input file has 990 records which have to be split into 9 files of 100 records each and the last file of 90 records. A total of 10 files
Use the NLineInputFormat and set the mapred.line.input.format.linespermap
to 100. This way each mapper will process 100 lines from the input data set. Set the number of reducers to 10, which is the number of output files.
In the mapper emit Key between 1 and 10 (which is the number of output files) and emit the value as the input record. Make sure that the keys emitted by mappers are balanced between 1 and 10 and not skewed.
回答2:
A different approach is to have a single reducer and use the multipleoutputformat to generate multiple outputfiles. In that reducer you can simply have a counter and change the output file name when needed.
来源:https://stackoverflow.com/questions/20575912/how-to-create-output-files-with-fixed-number-of-lines-in-hadoop-map-reduce