JQ, Hadoop: taking command from a file

房东的猫 提交于 2021-02-19 08:33:00

问题


I have been enjoying the powerful filters provided by JQ (Doc).

Twitter's public API gives nicely formatted json files. I have access to a large amount of it, and I have access to a Hadoop cluster. There I decided to, instead of loading them in Pig using Elephantbird, try out JQ in mapper streaming to see if it is any faster.

Here is my final query:

nohup hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar\
    -files $HOME/bin/jq \
    -D mapreduce.map.memory.mb=2048\
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
    -mapper "./jq --raw-output 'select((.lang == \"en\") and (.entities.hashtags | length > 0)) | .entities.hashtags[] as \$tags | [.id_str, .user.id_str, .created_at, \$tags.text] | @csv'" \
    -reducer NONE \
    -input /path/to/input/*.json.gz \
    -output /path/to/output \
    &

I am distributing my local jq executable to every compute node and telling them to run my command with it for their stdin stream.

The query is long enough that I got into quoting and formatting issues in bash and JQ.

I wish I could have written something like this:

nohup hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar\
        -files $HOME/bin/jq,$PROJECT_DIR/cmd.jq \
        -D mapreduce.map.memory.mb=2048\
        -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
        -mapper "./jq --raw-output --run-cmd-file=cmd.jq" \
        -reducer NONE \
        -input /path/to/input/*.json.gz \
        -output /path/to/output \
        &

where I can just put my command in a file, ship it to compute nodes and call it with an option.


回答1:


It looks like you somehow missed the -f FILE option!



来源:https://stackoverflow.com/questions/35484244/jq-hadoop-taking-command-from-a-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!