问题
I am working on the design thought for generating dummy or binary variable in pig script or R script
problem: Input to pig script: Any arbitrary relation say as below table
A B C
a1 b1 c1
a2 b2 c2
a1 b1 c3
suppose we have to generate binary cols based on B,C output should be
A B C B.b1 B.b2 C.c1 C.c2 C.c3
a1 b1 c1 1 0 1 0 0
a2 b2 c2 0 1 0 1 0
a1 b1 c3 1 0 0 0 1
I think writing UDF would be right approach on it. However i am not sure as how to define the output schema for the udf as the column names are supplied by the user and we dont know in the relation how many distinct cols needs to be generated. Could somebody please guide me as a high level design to achieve it. is it feasible to do in R do we have some online solution for this statistical problem
回答1:
DF <- read.table(text="A B C
a1 b1 c1
a2 b2 c2
a1 b1 c3", header = TRUE)
do.call(cbind, list(DF,
model.matrix(~ 0 + B, data = DF),
model.matrix(~ 0 + C, data = DF)))
# A B C Bb1 Bb2 Cc1 Cc2 Cc3
#1 a1 b1 c1 1 0 1 0 0
#2 a2 b2 c2 0 1 0 1 0
#3 a1 b1 c3 1 0 0 0 1
回答2:
You could try cSplit_e
from library(splitstackshape)
in R
cSplit_e(cSplit_e(df, 'B', type='character', fill=0, 'binary'),
'C', type='character', fill=0, 'binary')
# A B C B_b1 B_b2 C_c1 C_c2 C_c3
#1 a1 b1 c1 1 0 1 0 0
#2 a2 b2 c2 0 1 0 1 0
#3 a1 b1 c3 1 0 0 0 1
来源:https://stackoverflow.com/questions/27693921/generating-binary-variables-in-pig-r