问题
I have on data set as below -
Key Value
k1 a1,b1,c1,d1
k2 a2,b1,c2,d2
k3 a3,b1,c3,d3
k4 a4,b1,c4,d4
k5 a5,b1,c5,d5
In above data set Keys are distinct and in values one of comma separated value i.e. b1 is common among all value set. And my requirement is like if that value is same then out of those values only one value should be selected as output record. In short i want to remove duplicate values when keys are distinct.
Can anybody tell me how to approach?
I have below implementation -
a. like at reducer side, i can add values in set and then it will remove duplicates automatically.
But i want to know if there is any solution from Map Reduce framework side to identify duplicate values and remove them.
Desired Output-
k5 a5,b1,c5,d5
It should take latest key for which last duplicate value occurred.
Thanks in advance.
来源:https://stackoverflow.com/questions/38065737/how-to-remove-duplicate-values-using-mapreduce