Hey I\'m fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/
It d
I got the same error, sub-process failed with code 1
[cloudera@quickstart ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/input -output /user/cloudera/output_join -mapper /home/cloudera/join1_mapper.py -reducer /home/cloudera/join1_reducer.py
This is primarily because of a hadoop unable to access your input files, or may be you have something in your input which is more than required, or something missing. So, be very very careful with the input directory and files you have in them. I would say, only place exactly required input files in the input directory for the assignment and remove rest of them.
Also make sure your mapper and reducer files are executable.
chmod +x mapper.py and chmod +x reducer.py
Run the mapper of reducer python file using cat using only mapper:
cat join2_gen*.txt | ./mapper.py | sort
using reducer:
cat join2_gen*.txt | ./mapper.py | sort | ./reducer.py
The reason for running them using cat is because If your input files have any error you can remove them before you run on Hadoop cluster. Sometimes map/reduce jobs cannot find the python errors!!