hadoop pig joining on any matching tuple values

前提是你 提交于 2019-12-25 02:38:11

问题


I'm new to pig and trying to use it to process a dataset. I have a set of records that looks like

id    elements
--------------
1     ["a","b","c"]
2     ["a","f","g"]
3     ["f","g","h"]

The idea is that I want to create tuples of elements that have any overlapping elements. If elements was just a single item instead of array, I could do a simple join like:

A = LOAD 'mydata' ...
B = FOREACH A GENERATE id as id_2, elements as elements_2;
C = JOIN A BY elements, B BY elements_2;

But since elements is an array, this won't work if there is only a partial overlap. Any thoughts on how to do this in pig?

The intended output would give the tuples that have overlap:

(1,2)
(2,3)

回答1:


I don't think it's possible to use JOIN for this. One (not so elegant) solution is to CROSS both relations and then do a FILTER operation. The FILTER condition could either be a UDF or some kind of regex_extract_all and a matching of the produced fields. If the size of the array is always 3 I would probably go for the regex_extract_all solution.



来源:https://stackoverflow.com/questions/22498259/hadoop-pig-joining-on-any-matching-tuple-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!