How to remove duplicate values using MapReduce

问题

I have on data set as below -

Key Value

k1 a1,b1,c1,d1

k2 a2,b1,c2,d2

k3 a3,b1,c3,d3

k4 a4,b1,c4,d4

k5 a5,b1,c5,d5

In above data set Keys are distinct and in values one of comma separated value i.e. b1 is common among all value set. And my requirement is like if that value is same then out of those values only one value should be selected as output record. In short i want to remove duplicate values when keys are distinct.

Can anybody tell me how to approach?

I have below implementation -

a. like at reducer side, i can add values in set and then it will remove duplicates automatically.

But i want to know if there is any solution from Map Reduce framework side to identify duplicate values and remove them.

Desired Output-

k5 a5,b1,c5,d5

It should take latest key for which last duplicate value occurred.

Thanks in advance.

来源：https://stackoverflow.com/questions/38065737/how-to-remove-duplicate-values-using-mapreduce

标签

Hadoop

MapReduce

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!