How to one-hot-encode factor variables with data.table?

前端 未结 5 1587
我寻月下人不归
我寻月下人不归 2020-12-05 16:28

For those unfamiliar, one-hot encoding simply refers to converting a column of categories (i.e. a factor) into multiple columns of binary indicator variables where each new

5条回答
  •  时光取名叫无心
    2020-12-05 16:38

    Here you go:

    dcast(melt(dt, id.vars='ID'), ID ~ variable + value, fun = length)
    #   ID Color_blue Color_green Color_red Shape_cirlce Shape_square Shape_triangle
    #1:  1          0           1         0            0            1              0
    #2:  2          0           0         1            0            0              1
    #3:  3          0           0         1            0            1              0
    #4:  4          1           0         0            0            0              1
    #5:  5          0           1         0            1            0              0
    

    To get the missing factors you can do the following:

    res = dcast(melt(dt, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length)
    setnames(res, c("ID", unlist(lapply(2:ncol(dt),
                                 function(i) paste(names(dt)[i], levels(dt[[i]]), sep = "_")))))
    res
    #   ID Color_blue Color_green Color_red Color_purple Shape_cirlce Shape_square Shape_triangle
    #1:  1          0           1         0            0            0            1              0
    #2:  2          0           0         1            0            0            0              1
    #3:  3          0           0         1            0            0            1              0
    #4:  4          1           0         0            0            0            0              1
    #5:  5          0           1         0            0            1            0              0
    

提交回复
热议问题