Hadoop Pig UDF invocation issue

和自甴很熟 提交于 2019-12-10 11:58:28

问题


The following code works quite well, but when I already have two existing bags (with their alias, suppose S1 and S2 for representing two existing bags for two sets), wondering how to call UDF setDifference to generate set differences? I think if I manually construct an additional bag, using my already existing input bags (S1 and S2), it will be additional overhead?

register datafu-1.2.0.jar;
define setDifference datafu.pig.sets.SetDifference();

-- ({(3),(4),(1),(2),(7),(5),(6)} \t {(1),(3),(5),(12)})
A = load 'input.txt' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});

F1 = foreach A generate B1;
F2 = foreach A generate B2;

differenced = FOREACH A {
  -- input bags must be sorted
  sorted_b1 = ORDER B1 by val;
  sorted_b2 = ORDER B2 by val;
  GENERATE setDifference(sorted_b1,sorted_b2);
}

-- produces: ({(2),(4),(6),(7)})
DUMP differenced;

Update:

Question is, suppose I have two bags already, how to call UDF setDifference to get set differences? Do I need to build another super bag which contains the two separate bags? Thanks.

thanks in advance, Lin


回答1:


I don't see any overhead issue with the UDF invocation.

Ref : http://datafu.incubator.apache.org/docs/datafu/guide/set-operations.html, we have a example for using SetDifference method.

As per API (http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/sets/SetDifference.html) SetDifference method takes bags as input and emits the difference between them.

N.B. Do note that the input bags have to be sorted.

In the example snippet shared, I don't see the need of below code snippet

F1 = foreach A generate B1;
F2 = foreach A generate B2;


来源:https://stackoverflow.com/questions/32429735/hadoop-pig-udf-invocation-issue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!