fails when using hadoop streaming with python combiner

雨燕双飞 提交于 2019-12-23 19:37:33

问题


I try to use hadoop streaming by python to compute average values of of input keys. Here are the codes of mapper, combiner and reducer:

#mapper:

import sys

def map(argv):
    line = sys.stdin.readline()
    try:
        while line:
            word, num = line.split()
            num = int(num)
            print word+'\t'+str(num)
            line = sys.stdin.readline()
    except Exception, ex:
        print 'mapper ex:'+str(ex)
        return None

if __name__ == "__main__":
    map(sys.argv)

#combiner
import sys

def combine(argv):
    line = sys.stdin.readline()
    cur_word = ''
    cur_num = 0
    cur_times = 0
    try:
        while line:
            word, num = line.split('\t')

            if word != cur_word:
                if cur_word != '':
                    print cur_word+'\t'+str(cur_num)+'\t'+str(cur_times)
                cur_word = word
                cur_num = 0
                cur_times = 0
            cur_num += int(num)
            cur_times += 1
            line = sys.stdin.readline()
        print cur_word+'\t'+str(cur_num)+'\t'+str(cur_times)

    except Exception, ex:
        print 'except:{0}'.format(ex)
        return None

if __name__ == "__main__":
    combine(sys.argv)


#reducer

import sys

def reduce(argv):
    line = sys.stdin.readline()
    cur_word = ''
    cur_num = 0
    cur_times = 0
    try:
        while line:
            word, num, times = line.split('\t')
            if word != cur_word:
                if cur_word != '':
                    if cur_times != 0:
                        avr = cur_num / cur_times
                        print cur_word+'\t'+str(cur_num)+'\t'+str(cur_times)+'\t'+str(avr)
                    else:
                        print cur_word+'\t'+str(cur_num)+'\t'+str(cur_times)+'\t'+'0'
                cur_word = word
                cur_num = 0
                cur_times = 0
            cur_num += int(num)
            cur_times += int(times)

            line = sys.stdin.readline()

        if cur_times != 0:
            avr = cur_num / cur_times
            print cur_word+'\t'+str(cur_num)+'\t'+str(cur_times)+'\t'+str(avr)
        else:
            print cur_word+'\t'+str(cur_num)+'\t'+str(cur_times)+'\t'+'0'

    except Exception, ex:
        print 'except:{0}'.format(ex)
        return None

if __name__ == "__main__":
    reduce(sys.argv)

it seems a simple map-combine-reduce process, isn't it? But the reduce fails every time. However, if I use no combiner, but the combiner.py as reducer, it works.

would someone appreciate any help, Thanks a lot.

来源:https://stackoverflow.com/questions/18375861/fails-when-using-hadoop-streaming-with-python-combiner

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!