latin pig bag to tuple after group by

问题

I have the following data with schema (t0: chararray,t1: int,t2: int)

(B,4,2)
(A,2,3)
(A,3,2)
(B,2,2)
(A,1,2)
(B,1,2)

I'd like to generate the following results (group by t0, and ordered by t1)

(A, ((1,2),(2,3),(3,2)))
(B, ((1,2),(2,2),(4,2)))

Please note I want only tuples in the second component, not bags. Please help.

回答1:

You should be able to do it like this.

-- A: (t0: chararray,t1: int,t2: int)

B = GROUP A BY t0 ;
C = FOREACH B {
            -- Project out the first column of A.
            projected = FOREACH A GENERATE t1, t2 ;
            -- Now you can order the projection.
            ordered = ORDER projected BY t1 ;
    GENERATE group AS t0, ordered AS vals ;
}

You can read more about nested FOREACHs here.

NOTE/UPDATE: It seems when I answered this question originally I missed the part where the asker asked for output to be in tuple form. Tuples should only be used when you know the exact number and position of the fields in the tuple. Otherwise then your schema will not be defined and it will be very difficult in order to access the fields. This is because the entire tuple will be treated as a bytearray, and so you will manually have to find and cast everything.

If you must do it this way you can not do this in pure pig. You'll have to use some sort of UDF to do this. I would recommend Python.

回答2:

use FOREACH. See the "Nested Projection" section on the PigLatin page: http://wiki.apache.org/pig/PigLatin

回答3:

You may try this..

grunt> a_input = Load '/home/training/pig/Join/order_temp.csv' Using PigStorage(',') as (t0:chararray,t1:int,t2:int);

grunt> b= Group (Order a_input by t1) By t0;

来源：https://stackoverflow.com/questions/19948614/latin-pig-bag-to-tuple-after-group-by

标签

apache-pig