Generating binary variables in Pig\R

问题

I am working on the design thought for generating dummy or binary variable in pig script or R script

problem: Input to pig script: Any arbitrary relation say as below table

suppose we have to generate binary cols based on B,C output should be

    A   B   C   B.b1    B.b2    C.c1    C.c2        C.c3
    a1  b1  c1  1        0       1       0       0
    a2  b2  c2  0        1       0       1       0
    a1  b1  c3  1        0       0       0       1

I think writing UDF would be right approach on it. However i am not sure as how to define the output schema for the udf as the column names are supplied by the user and we dont know in the relation how many distinct cols needs to be generated. Could somebody please guide me as a high level design to achieve it. is it feasible to do in R do we have some online solution for this statistical problem

回答1:

DF <- read.table(text="A   B   C
    a1  b1  c1
    a2  b2  c2  
    a1  b1  c3", header = TRUE)

do.call(cbind, list(DF,
                    model.matrix(~ 0 + B, data = DF),
                    model.matrix(~ 0 + C, data = DF)))
#   A  B  C Bb1 Bb2 Cc1 Cc2 Cc3
#1 a1 b1 c1   1   0   1   0   0
#2 a2 b2 c2   0   1   0   1   0
#3 a1 b1 c3   1   0   0   0   1

回答2:

You could try cSplit_e from library(splitstackshape) in R

 cSplit_e(cSplit_e(df, 'B', type='character', fill=0, 'binary'), 
          'C', type='character', fill=0, 'binary')
 #   A  B  C B_b1 B_b2 C_c1 C_c2 C_c3
 #1 a1 b1 c1    1    0    1    0    0
 #2 a2 b2 c2    0    1    0    1    0
 #3 a1 b1 c3    1    0    0    0    1

来源：https://stackoverflow.com/questions/27693921/generating-binary-variables-in-pig-r

标签

apache-pig

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!