How to create output files with fixed number of lines in hadoop/map reduce?

问题

Let's say we have N input files with different number of lines. We need to generate output files such the each output file has exactly K number of lines (except the last one which can have < K records).

Is it possible to do this using single MR job?
We should open the files for writing explicitly in reducer.
The records in output should be shuffled.

thanks,
Paramesh

回答1:

Assuming that the input file has 990 records which have to be split into 9 files of 100 records each and the last file of 90 records. A total of 10 files

Use the NLineInputFormat and set the mapred.line.input.format.linespermap to 100. This way each mapper will process 100 lines from the input data set. Set the number of reducers to 10, which is the number of output files.

In the mapper emit Key between 1 and 10 (which is the number of output files) and emit the value as the input record. Make sure that the keys emitted by mappers are balanced between 1 and 10 and not skewed.

回答2:

A different approach is to have a single reducer and use the multipleoutputformat to generate multiple outputfiles. In that reducer you can simply have a counter and change the output file name when needed.

来源：https://stackoverflow.com/questions/20575912/how-to-create-output-files-with-fixed-number-of-lines-in-hadoop-map-reduce

标签

Hadoop

MapReduce

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!