Generating binary variables in Pig\R

谁都会走 提交于 2020-01-14 06:55:12


I am working on the design thought for generating dummy or binary variable in pig script or R script

problem: Input to pig script: Any arbitrary relation say as below table

    A   B   C
    a1  b1  c1
    a2  b2  c2  
    a1  b1  c3

suppose we have to generate binary cols based on B,C output should be

    A   B   C   B.b1    B.b2    C.c1    C.c2        C.c3
    a1  b1  c1  1        0       1       0       0
    a2  b2  c2  0        1       0       1       0
    a1  b1  c3  1        0       0       0       1

I think writing UDF would be right approach on it. However i am not sure as how to define the output schema for the udf as the column names are supplied by the user and we dont know in the relation how many distinct cols needs to be generated. Could somebody please guide me as a high level design to achieve it. is it feasible to do in R do we have some online solution for this statistical problem


DF <- read.table(text="A   B   C
    a1  b1  c1
    a2  b2  c2  
    a1  b1  c3", header = TRUE), list(DF,
                    model.matrix(~ 0 + B, data = DF),
                    model.matrix(~ 0 + C, data = DF)))
#   A  B  C Bb1 Bb2 Cc1 Cc2 Cc3
#1 a1 b1 c1   1   0   1   0   0
#2 a2 b2 c2   0   1   0   1   0
#3 a1 b1 c3   1   0   0   0   1


You could try cSplit_e from library(splitstackshape) in R

 cSplit_e(cSplit_e(df, 'B', type='character', fill=0, 'binary'), 
          'C', type='character', fill=0, 'binary')
 #   A  B  C B_b1 B_b2 C_c1 C_c2 C_c3
 #1 a1 b1 c1    1    0    1    0    0
 #2 a2 b2 c2    0    1    0    1    0
 #3 a1 b1 c3    1    0    0    0    1

