Is there a combine Input format for hadoop streaming?

烈酒焚心 提交于 2019-12-23 03:44:08

问题


I have many small input files, and I want to combine them using some input format like CombineFileInputFormat to launch fewer mapper tasks. I know I can use Java API to do this, but I don't know whether there's a streaming jar library to support this function while I'm using Hadoop streaming.


回答1:


Hadoop streaming uses TextInputFormat by default but any other input format can be used, including CombineFileInputFormat. You can change the input format from the command line, using the option -inputformat. Be sure to use the old API and implement org.apache.hadoop.mapred.lib.CombineFileInputFormat. The new API isn't supported yet.

$HADOOP_HOME/bin/hadoop jar \
      $HADOOP_HOME/hadoop-streaming.jar \
      -inputformat foo.bar.MyCombineFileInputFormat \
      -Dmapred.max.split.size=524288000 \
      -Dstream.map.input.ignoreKey=true \
      ...

Example of CombineFileInputFormat



来源:https://stackoverflow.com/questions/19485535/is-there-a-combine-input-format-for-hadoop-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!