pig how to filter distinct couples (pairs)

醉酒当歌 提交于 2019-12-11 03:15:02

问题


I am new to Pig. I have a Pig script which generates tab-separated pairs between two element. One pair for each line, for example:

John   Paul
Tom    Nik
Mark   Bill
Tom    Nik
Paul   John

I need to filter out duplicate combinations. If I use DISTINCT, I filter out double "Tom Nik" entry. The result is:

John   Paul
Tom    Nik
Mark   Bill
Paul   John

The problem with this approach is that I am left with both "John Paul" and "Paul John", which for my purposes should be treated as the same (same combination). Is there a way to remove permutate combinations?


回答1:


I'm not sure how string comparisons is implemented in Pig, but it may be worthwhile to try something like:

-- A is your input
B = FOREACH A GENERATE FLATTEN(($0 < $1 ? ($0, $1) : ($1, $0))) ; 
C = DISTINCT B ;

By sorting the names so that the 'smaller' always appears first both John Paul and Paul John should now be in the same order, making the DISTINCT eliminate one.

However, this approach all depends on how the string comparison is implemented. For example if it compares length then the John Paul case will not be filtered correctly.



来源:https://stackoverflow.com/questions/22812857/pig-how-to-filter-distinct-couples-pairs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!