Group by X OR Y in Pig

假如想象 提交于 2020-01-06 23:43:47

问题


I am processing a big amount of data with Pig and I need to group records by one field OR another. Be careful that it is not classic GROUP BY X AND Y, I mean, you have to group two records if they have the same value for the attributes X OR Y.

For example, given this dataset:

1, a, 'r1'
2, b, 'r2'
3, c, 'r3'
4, a, 'r4'
3, d, 'r5'
5, c, 'r6'
5, e, 'r7'

The result of grouping by first OR second field should be:

{(1, a, 'r1'), (4, a, 'r4')}
{(2, b, 'r2')}
{(3, c, 'r3'), (3, d, 'r5'), (5, c, 'r6'), (5, e, 'r7')}

(1) Because 'r1' and 'r4' have the same value for the second attribute.

(2) The record 'r2' does not have any coincidence for the first OR second fields.

(3) Finally, 'r3' shares the value of the first attribute with 'r5', and the value of its second field with 'r6'. And 'r6' shares with 'r7' the same value for the first attribute. Note that 'r3' and 'r7' do not have any of their fields in common, but chaining records they ended in the same group.

I have solved this problem using Java (out of Pig), and I know how to do it using Map-Reduce. But (in order to learn) I would like to know how to do it using Pig-latin, or any library that could help me in this stuff.

来源:https://stackoverflow.com/questions/17149663/group-by-x-or-y-in-pig

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!