Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

前端未结

关注

 4  1614

隐瞒了意图╮ 2020-12-06 02:35

Hey I\'m fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/

It d

4条回答

萌比男神i (楼主)

2020-12-06 03:11
I got the same error, sub-process failed with code 1
```
[cloudera@quickstart ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/input -output /user/cloudera/output_join -mapper /home/cloudera/join1_mapper.py -reducer /home/cloudera/join1_reducer.py
```
1. This is primarily because of a hadoop unable to access your input files, or may be you have something in your input which is more than required, or something missing. So, be very very careful with the input directory and files you have in them. I would say, only place exactly required input files in the input directory for the assignment and remove rest of them.
2. Also make sure your mapper and reducer files are executable. chmod +x mapper.py and chmod +x reducer.py
3. Run the mapper of reducer python file using cat using only mapper: cat join2_gen*.txt | ./mapper.py | sort using reducer: cat join2_gen*.txt | ./mapper.py | sort | ./reducer.py The reason for running them using cat is because If your input files have any error you can remove them before you run on Hadoop cluster. Sometimes map/reduce jobs cannot find the python errors!!
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...