How to get the Reducer to emit only duplicates

余生长醉 提交于 2019-12-12 05:27:56

问题


I have a Mapper that is going through lots of data and emitting ID numbers as keys with the value of 1. What I hope to accomplish with the MapReduce job is to get a list of all IDs that have been found more than one time across all data, which is a list of duplicate IDs. For example:

Mapper emits:
abc 1
efg 1
cba 1
abc 1
dhh 1

In this case, you can see that the ID 'abc' has been emitted more than one time by the Mapper.

How do I edit this Reducer so that it will only emit the duplicates? i.e. keys that have a value greater than 1:

import sys
import codecs

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
inData = codecs.getreader('utf-8')(sys.stdin)

(last_key, tot_cnt) = (None, 0)
for line in inData:
    (key, val) = line.strip().split("\t")
    if last_key and last_key != key:
        sys.stdout.write("%s\t%s\n" % (last_key,tot_cnt))
        (last_key, tot_cnt) = (key, int(val))
    else:
        (last_key, tot_cnt) = (key, tot_cnt + int(val))

if last_key:
    sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))

回答1:


You have committed mistake in few places.

  1. This code:

    if last_key and last_key != key:
        sys.stdout.write("%s\t%s\n" % (last_key,tot_cnt))
    

    should be changed to:

    if last_key != key:
        if(tot_cnt > 1):
            sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))
    

    You were not checking for tot_cnt > 1.

  2. Last 2 lines:

    if last_key:
        sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))
    

    should be changed to:

    if last_key and tot_cnt > 1:
        sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))
    

    Here again, you were not checking for tot_cnt > 1.

Following is the modified code, which works for me:

import sys
import codecs

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
inData = codecs.getreader('utf-8')(sys.stdin)

(last_key, tot_cnt) = (None, 0)
for line in inData:
    (key, val) = line.strip().split("\t")
    if last_key != key:
        if(tot_cnt > 1):
            sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))
        (last_key, tot_cnt) = (key, int(val))
    else:
        (last_key, tot_cnt) = (key, tot_cnt + int(val))

if last_key and tot_cnt > 1:
    sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))

I get following output, for your data:

abc     2


来源:https://stackoverflow.com/questions/34233451/how-to-get-the-reducer-to-emit-only-duplicates

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!