How do I pass a parameter to a python Hadoop streaming job?

被刻印的时光 ゝ 提交于 2019-11-30 03:04:20

问题


For a python Hadoop streaming job, how do I pass a parameter to, for example, the reducer script so that it behaves different based on the parameter being passed in?

I understand that streaming jobs are called in the format of:

hadoop jar hadoop-streaming.jar -input -output -mapper mapper.py -reducer reducer.py ...

I want to affect reducer.py.


回答1:


The argument to the command line option -reducer can be any command, so you can try:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input inputDirs \
    -output outputDir \
    -mapper myMapper.py \
    -reducer 'myReducer.py 1 2 3' \
    -file myMapper.py \
    -file myReducer.py

assuming myReducer.py is made executable. Disclaimer: I have not tried it, but I have passed similar complex strings to -mapper and -reducer before.

That said, have you tried the

-cmdenv name=value

option, and just have your Python reducer get its value from the environment? It's just another way to do things.




回答2:


In your Python code,

import os
(...)
os.environ["PARAM_OPT"]

In your Hapdoop command include:

hadoop jar \
(...)
-cmdenv PARAM_OPT=value\
(...)



回答3:


If you are using python you may want to check out dumbo which provides a nice wrapper around hadoop streaming. In dumbo you pass parameters with -param as in :

dumbo start yourpython.py -hadoop <hadoop-path> -input <input> -output <output>  -param <parameter>=<value>

And then read it in the reducer

def reducer:
def __init__(self):
    self.parmeter = int(self.params["<parameter>"])
def __call__(self, key, values):
    do something interesting ...

You can read more in the dumbo tutorial




回答4:


You can -reducer as the below command

hadoop jar hadoop-streaming.jar \
-mapper 'count_mapper.py arg1 arg2' -file count_mapper.py \
-reducer 'count_reducer.py arg3' -file count_reducer.py \

you can revise this Link



来源:https://stackoverflow.com/questions/9509063/how-do-i-pass-a-parameter-to-a-python-hadoop-streaming-job

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!