Map-Reduce/Hadoop sort by integer value (using MRJob)

空扰寡人 提交于 2019-12-06 12:18:25

问题


This is an MRJob implementation of a simple Map-Reduce sorting functionality. In beta.py:

from mrjob.job import MRJob

class Beta(MRJob):
    def mapper(self, _, line):
        """
        """
        l = line.split(' ')
        yield l[1], l[0]

    def reducer(self, key, val):
        yield key, [v for v in val][0]


if __name__ == '__main__':
    Beta.run()

I run it using the text:

1 1
2 4
3 8
4 2
4 7
5 5
6 10
7 11

One can run this using:

cat <filename> | python beta.py

Now the issue is the output is sorted assuming that the key is of type string (which is probably the case here). The output is:

"1"     "1"
"10"    "6"
"11"    "7"
"2"     "4"
"4"     "2"
"5"     "5"
"7"     "4"
"8"     "3"

The output that I want is:

"1"     "1"
"2"     "4"
"4"     "2"
"5"     "5"
"7"     "4"
"8"     "3"
"10"    "6"
"11"    "7"

I am not sure if this is to do with fiddling with protocols in MRJob as protocols are job specific and not step specific.

EDIT (Solution): I have got the answer for this one. The idea is that one needs to prepend 'O-bytes' to every number such that the number of bytes in every number is same the number of bytes in the largest number. At least that's what I remembered from my classes. I cannot add the answer right now as it won't permit me but this is the only solution I've got. If anyone's got something more transparent and easy, please share.


回答1:


Simple solution (more robust might be based on tuning how Hadoop is sorting mapper output)

class Beta(MRJob):

    def mapper (self, _, line):
        l = line.strip('\n').split()
        yield '%010d'%int(l[1]), l[0]

    def reducer(self, key, values):
        yield int(key),int(list(values)[0])


来源:https://stackoverflow.com/questions/20156817/map-reduce-hadoop-sort-by-integer-value-using-mrjob

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!