pig programming to use split on group by having count(*)

我与影子孤独终老i 提交于 2019-12-08 13:46:40

问题


Input file is:

2, cornflakes, Regular,General Mills, 12
3, cornflakes, Mixed Nuts, Post, 14
4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7

filter3 = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);

SPLIT filter3 INTO filter4 IF (FOREACH (filter3 GROUP BY item) GENERATE group, COUNT(item < 3)), filter6_pass OTHERWISE;

It is like having a SQL with a group by on item having count(*) < 3

The desired output is:

4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7

回答1:


Group by item, get the count and then use filter on the count

A = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
B = GROUP A BY item;
C = FOREACH B GENERATE group,COUNT(A.item) AS Total;
D = FILTER C BY Total > 3;
E = JOIN A BY item,D BY $0;
F = FOREACH E GENERATE $0..$4;
DUMP F;



来源:https://stackoverflow.com/questions/43355010/pig-programming-to-use-split-on-group-by-having-count

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!