mrjob

How does one specify the input file for a runner from Python?

為{幸葍}努か 提交于 2019-12-04 05:20:31
I am writing an external script to run a mapreduce job via the Python mrjob module on my laptop (not on Amazon Elastic Compute Cloud or any large cluster). I read from the mrjob documentation that I should use MRJob.make_runner() to run a mapreduce job from a separate python script as follows. mr_job = MRYourJob(args=['-r', 'emr']) with mr_job.make_runner() as runner: ... However, how do I specify which input file to use? I want to use a file "datalines.txt" in the same directory as my mapreduce script and other python script that runs the map reduce. Furthermore, how do I specify the output?

Multiple Inputs with MRJob

六眼飞鱼酱① 提交于 2019-11-30 05:21:54
I'm trying to learn to use Yelp's Python API for MapReduce, MRJob. Their simple word counter example makes sense, but I'm curious how one would handle an application involving multiple inputs. For instance, rather than simply counting the words in a document, multiplying a vector by a matrix. I came up with this solution, which functions, but feels silly: class MatrixVectMultiplyTast(MRJob): def multiply(self,key,line): line = map(float,line.split(" ")) v,col = line[-1],line[:-1] for i in xrange(len(col)): yield i,col[i]*v def sum(self,i,occurrences): yield i,sum(occurrences) def steps(self):

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

冷暖自知 提交于 2019-11-27 23:05:40
Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run MapReduce job using mrjob both locally and on Elastic Map Reduce. Well I'm trying to run this on my own Hadoop cluser. I ran the job using the following command. python density.py tiny.dat -r hadoop --hadoop-bin /usr/bin/hadoop > outputmusic And this is what I get: HADOOP: Running job: job_1369345811890_0245 HADOOP: Job job_1369345811890_0245 running in uber mode : false HADOOP: map 0% reduce 0%