mrjob

mrjob: how does the example automatically know how to find lines in text file?

跟風遠走 提交于 2019-12-10 10:06:22
问题 I'm trying to understand the example for mrjob better from mrjob.job import MRJob class MRWordFrequencyCount(MRJob): def mapper(self, _, line): yield "chars", len(line) yield "words", len(line.split()) yield "lines", 1 def reducer(self, key, values): yield key, sum(values) if __name__ == '__main__': MRWordFrequencyCount.run() I run it by $ python word_count.py my_file.txt and it works as expected but I don't get how it automatically knows that it's going to read a text file and split it by

How does one specify the input file for a runner from Python?

て烟熏妆下的殇ゞ 提交于 2019-12-09 18:11:34
问题 I am writing an external script to run a mapreduce job via the Python mrjob module on my laptop (not on Amazon Elastic Compute Cloud or any large cluster). I read from the mrjob documentation that I should use MRJob.make_runner() to run a mapreduce job from a separate python script as follows. mr_job = MRYourJob(args=['-r', 'emr']) with mr_job.make_runner() as runner: ... However, how do I specify which input file to use? I want to use a file "datalines.txt" in the same directory as my

MRjob: Can a reducer perform 2 operations?

血红的双手。 提交于 2019-12-08 02:53:53
问题 I am trying to yield the probability each key,value pair generated from mapper has. So, lets say mapper yields: a, (r, 5) a, (e, 6) a, (w, 7) I need to add 5+6+7 = 18 and then find probabilities 5/18, 6/18, 7/18 so the final output from the reducer would look like: a, [[r, 5, 0.278], [e, 6, 0.33], [w, 7, 0.389]] so far, I can only get the reducer to sum all integers from the value. How can I make it to go back and divide each instance by the total sum? thanks! 回答1: Pai's solution is

Hadoop removes MapReduce history when it is restarted

混江龙づ霸主 提交于 2019-12-08 02:49:35
问题 I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability. During the above mentioned process, I have obviously had to restart several times all Hadoop environment. Every time I restarted Hadoop, all MapReduce jobs are removed and the job counter starts again from "job_2013*_0001". For comparison reasons, it is very important

Map-Reduce/Hadoop sort by integer value (using MRJob)

空扰寡人 提交于 2019-12-06 12:18:25
问题 This is an MRJob implementation of a simple Map-Reduce sorting functionality. In beta.py : from mrjob.job import MRJob class Beta(MRJob): def mapper(self, _, line): """ """ l = line.split(' ') yield l[1], l[0] def reducer(self, key, val): yield key, [v for v in val][0] if __name__ == '__main__': Beta.run() I run it using the text: 1 1 2 4 3 8 4 2 4 7 5 5 6 10 7 11 One can run this using: cat <filename> | python beta.py Now the issue is the output is sorted assuming that the key is of type

MRjob: Can a reducer perform 2 operations?

久未见 提交于 2019-12-06 05:03:26
I am trying to yield the probability each key,value pair generated from mapper has. So, lets say mapper yields: a, (r, 5) a, (e, 6) a, (w, 7) I need to add 5+6+7 = 18 and then find probabilities 5/18, 6/18, 7/18 so the final output from the reducer would look like: a, [[r, 5, 0.278], [e, 6, 0.33], [w, 7, 0.389]] so far, I can only get the reducer to sum all integers from the value. How can I make it to go back and divide each instance by the total sum? thanks! ask417 Pai's solution is technically correct, but in practice this will give you a lot of strife, as setting the partitioning can be a

How to specifically determine input for each map step in MRJob?

守給你的承諾、 提交于 2019-12-06 04:31:11
问题 I am working on a map-reduce job, consisting multiple steps. Using mrjob every step receives previous step output. The problem is I don't want it to. What I want is to extract some information and use it in second step against all input and so on. Is it possible to do this using mrjob? Note : Since I don't want to use emr, this question is not much of help to me. UPDATE : If it would not be possible to do this on a single job, I need to do it in two separate jobs. In this case, is there any

mrjob: how does the example automatically know how to find lines in text file?

妖精的绣舞 提交于 2019-12-05 19:20:55
I'm trying to understand the example for mrjob better from mrjob.job import MRJob class MRWordFrequencyCount(MRJob): def mapper(self, _, line): yield "chars", len(line) yield "words", len(line.split()) yield "lines", 1 def reducer(self, key, values): yield key, sum(values) if __name__ == '__main__': MRWordFrequencyCount.run() I run it by $ python word_count.py my_file.txt and it works as expected but I don't get how it automatically knows that it's going to read a text file and split it by each line. and I'm not sure what the _ does either. From what I understand, the mapper() generates the

Map-Reduce/Hadoop sort by integer value (using MRJob)

浪尽此生 提交于 2019-12-04 19:33:52
This is an MRJob implementation of a simple Map-Reduce sorting functionality. In beta.py : from mrjob.job import MRJob class Beta(MRJob): def mapper(self, _, line): """ """ l = line.split(' ') yield l[1], l[0] def reducer(self, key, val): yield key, [v for v in val][0] if __name__ == '__main__': Beta.run() I run it using the text: 1 1 2 4 3 8 4 2 4 7 5 5 6 10 7 11 One can run this using: cat <filename> | python beta.py Now the issue is the output is sorted assuming that the key is of type string (which is probably the case here). The output is: "1" "1" "10" "6" "11" "7" "2" "4" "4" "2" "5" "5"

How to specifically determine input for each map step in MRJob?

你说的曾经没有我的故事 提交于 2019-12-04 11:17:15
I am working on a map-reduce job, consisting multiple steps. Using mrjob every step receives previous step output. The problem is I don't want it to. What I want is to extract some information and use it in second step against all input and so on. Is it possible to do this using mrjob? Note : Since I don't want to use emr, this question is not much of help to me. UPDATE : If it would not be possible to do this on a single job, I need to do it in two separate jobs. In this case, is there any way to wrap these two jobs and manage intermediate outpus, etc? You can use Runners You will have to