mrjob | 易学教程

How does one specify the input file for a runner from Python?

阅读更多关于 How does one specify the input file for a runner from Python?

I am writing an external script to run a mapreduce job via the Python mrjob module on my laptop (not on Amazon Elastic Compute Cloud or any large cluster). I read from the mrjob documentation that I should use MRJob.make_runner() to run a mapreduce job from a separate python script as follows. mr_job = MRYourJob(args=['-r', 'emr']) with mr_job.make_runner() as runner: ... However, how do I specify which input file to use? I want to use a file "datalines.txt" in the same directory as my mapreduce script and other python script that runs the map reduce. Furthermore, how do I specify the output?

Multiple Inputs with MRJob

阅读更多关于 Multiple Inputs with MRJob

I'm trying to learn to use Yelp's Python API for MapReduce, MRJob. Their simple word counter example makes sense, but I'm curious how one would handle an application involving multiple inputs. For instance, rather than simply counting the words in a document, multiplying a vector by a matrix. I came up with this solution, which functions, but feels silly: class MatrixVectMultiplyTast(MRJob): def multiply(self,key,line): line = map(float,line.split(" ")) v,col = line[-1],line[:-1] for i in xrange(len(col)): yield i,col[i]*v def sum(self,i,occurrences): yield i,sum(occurrences) def steps(self):

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

阅读更多关于 Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run MapReduce job using mrjob both locally and on Elastic Map Reduce. Well I'm trying to run this on my own Hadoop cluser. I ran the job using the following command. python density.py tiny.dat -r hadoop --hadoop-bin /usr/bin/hadoop > outputmusic And this is what I get: HADOOP: Running job: job_1369345811890_0245 HADOOP: Job job_1369345811890_0245 running in uber mode : false HADOOP: map 0% reduce 0%