问题
I have many small input files, and I want to combine them using some input format like CombineFileInputFormat
to launch fewer mapper tasks. I know I can use Java API to do this, but I don't know whether there's a streaming jar library to support this function while I'm using Hadoop streaming.
回答1:
Hadoop streaming uses TextInputFormat
by default but any other input format can be used, including CombineFileInputFormat
. You can change the input format from the command line, using the option -inputformat
. Be sure to use the old API and implement org.apache.hadoop.mapred.lib.CombineFileInputFormat
. The new API isn't supported yet.
$HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar \
-inputformat foo.bar.MyCombineFileInputFormat \
-Dmapred.max.split.size=524288000 \
-Dstream.map.input.ignoreKey=true \
...
Example of CombineFileInputFormat
来源:https://stackoverflow.com/questions/19485535/is-there-a-combine-input-format-for-hadoop-streaming