Hadoop streaming: single file or multi file per map. Don't Split

纵然是瞬间 提交于 2019-12-06 14:27:54
Varun Shingal

You can find the solution here:

http://wiki.apache.org/hadoop/FAQ#How_do_I_get_each_of_a_job.27s_maps_to_work_on_one_complete_input-file_and_not_allow_the_framework_to_split-up_the_files.3F

The easiest way I would suggest is to set mapred.min.split.size to a large value so that your files do not get split.

If this does not work then you would need to implement an InputFormat which is not very difficult to do and you can find the steps at: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat

Rather then depending on the min split size I would suggest an easier way is to Gzip your files.

There is a way to compress files using gzip

http://www.gzip.org/

If you are on Linux you compress the extracted data with

gzip -r /path/to/data

Now that you have this pass this data as your input in your hadoop streaming job.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!