Error when running python map reduce job using Hadoop streaming in Google Cloud Dataproc environment

强颜欢笑 提交于 2020-07-05 04:55:34

问题


I want to run python map reduce job in Google Cloud Dataproc using hadoop streaming method. My map reduce python script, input file and job result output are located in Google Cloud Storage.

I tried to run this command

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -mapper gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -file gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -reducer gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -input gs://bucket-name/intro_to_mapreduce/purchases.txt -output gs://bucket-name/intro_to_mapreduce/output_prod_cat

But I got this error output :

File: /home/ramaadhitia/gs:/bucket-name/intro_to_mapreduce/mapper_prod_cat.py does not exist, or is not readable.

Try -help for more information Streaming Command Failed!

Is cloud connector not working in hadoop streaming? Is there any other way to run python map reduce job using hadoop streaming with python script and input file located in Google Cloud Storage ?

Thank You


回答1:


The -file option from hadoop-streaming only works for local files. Note however, that its help text mentions that the -file flag is deprecated in favor of the generic -files option. Using the generic -files option allows us to specify a remote (hdfs / gs) file to stage. Note also that generic options must precede application specific flags.

Your invocation would become:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -files gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py,gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py \
    -mapper mapper_prod_cat.py \
    -reducer reducer_prod_cat.py \
    -input gs://bucket-name/intro_to_mapreduce/purchases.txt \
    -output gs://bucket-name/intro_to_mapreduce/output_prod_cat


来源:https://stackoverflow.com/questions/48003377/error-when-running-python-map-reduce-job-using-hadoop-streaming-in-google-cloud

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!