问题
I am processing a big amount of data with Pig and I need to group records by one field OR another. Be careful that it is not classic GROUP BY X AND Y
, I mean, you have to group two records if they have the same value for the attributes X OR Y.
For example, given this dataset:
1, a, 'r1'
2, b, 'r2'
3, c, 'r3'
4, a, 'r4'
3, d, 'r5'
5, c, 'r6'
5, e, 'r7'
The result of grouping by first OR second field should be:
{(1, a, 'r1'), (4, a, 'r4')}
{(2, b, 'r2')}
{(3, c, 'r3'), (3, d, 'r5'), (5, c, 'r6'), (5, e, 'r7')}
(1) Because 'r1' and 'r4' have the same value for the second attribute.
(2) The record 'r2' does not have any coincidence for the first OR second fields.
(3) Finally, 'r3' shares the value of the first attribute with 'r5', and the value of its second field with 'r6'. And 'r6' shares with 'r7' the same value for the first attribute. Note that 'r3' and 'r7' do not have any of their fields in common, but chaining records they ended in the same group.
I have solved this problem using Java (out of Pig), and I know how to do it using Map-Reduce. But (in order to learn) I would like to know how to do it using Pig-latin, or any library that could help me in this stuff.
来源:https://stackoverflow.com/questions/17149663/group-by-x-or-y-in-pig