Removing duplicates using PigLatin

后端 未结 2 835
一向
一向 2020-12-29 13:01

I\'m using PigLatin to filter some records.

User1  8 NYC 
User1  9 NYC 
User1  7 LA 
User2  4 NYC
User2  3 DC 

The script should remove the

2条回答
  •  离开以前
    2020-12-29 13:18

    For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2), you can do distinct only on a projection that has columns ($0, $2) or ($0) and lose $1.

    In order to select one record per user (any record) you could use a GROUP BY and a nested FOREACH with LIMIT. Ex:

    inpt = load '......' ......;
    user_grp = GROUP inpt BY $0;
    filtered = FOREACH user_grp {
          top_rec = LIMIT inpt 1;
          GENERATE FLATTEN(top_rec);
    };
    

    This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.

提交回复
热议问题