Generating Separate Output files in Hadoop Streaming

陌路散爱 提交于 2019-12-18 11:13:40

问题


Using only a mapper (a Python script) and no reducer, how can I output a separate file with the key as the filename, for each line of output, rather than having long files of output?


回答1:


You can either write to a text file on the local filesystem using python file functions or if you want to use HDFS use the Thrift API.




回答2:


The input and outputformat classes can be replaced by use of the -inputformat and -outputformat commandline parameters.

One example of how to do this can be found in the dumbo project, which is a python framework for writing streaming jobs. It has a feature for writing to multiple files, and internally it replaces the output format with a class from its sister project, feathers - fm.last.feathers.output.MultipleTextFiles.

The reducer then needs to emit a tuple as key, with the first component of the tuple being the path to the directory where the files with the key/value pairs should be written. There might still be multiple files, that depends on the number of reducers and the application.

I recommend looking into dumbo, it has many features that makes it easier to write Map/Reduce programs on Hadoop in python.




回答3:


Is it possible to replace the outputFormatClass, when using streaming? In a native Java implementation you would extend the MultipleTextOutputFormat class and modify the method that names the output file. Then define your implementation as new outputformat with JobConf's setOutputFormat method

you should verify, if this is possible in streaming too. I donno :-/



来源:https://stackoverflow.com/questions/1626786/generating-separate-output-files-in-hadoop-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!