one hot encode each column in a Int matrix in R

雨燕双飞 提交于 2020-04-30 06:46:37

问题


I have an issue of translating matrix into one hot encoding in R. I implemented in Matlab but i have difficulty in handling the object in R. Here i have an object of type 'matrix'.

I would like to apply one hot encoding to this matrix. I have problem with column names.

here is an example:

> set.seed(4)
> t <- matrix(floor(runif(10, 1,9)),5,5)

      [,1] [,2] [,3] [,4] [,5]
[1,]    5    3    5    3    5
[2,]    1    6    1    6    1
[3,]    3    8    3    8    3
[4,]    3    8    3    8    3
[5,]    7    1    7    1    7
> class(t)
[1] "matrix"

Expecting:

      1_1 1_3 1_5 1_7  2_1 2_3 2_6 2_8 ...
[1,]   0   0   1   0    0   1   0   0  ...
[2,]   1   0   0   0    0   0   1   0  ...
[3,]   0   1   0   0    0   0   0   1  ...
[4,]   0   1   0   0    0   0   0   1  ...   
[5,]   0   0   0   1    1   0   0   0  ...

I tried the following, but the matrix remains the same.

library(data.table)
library(mltools)
test_table <- one_hot(as.data.table(t))

Any suggestions would be very much appreciated.


回答1:


There are probably more concise ways to do this but this should work (and is at least easy to read and understand ;)

Suggested solution using base R and double loop:

set.seed(4)  
t <- matrix(floor(runif(10, 1,9)),5,5)

# initialize result object
#
t_hot <- NULL

# for each column in original matrix
#
for (col in seq_along(t[1,])) {
  # for each unique value in this column (sorted so the resulting
  # columns appear in order)
  #
  for (val in sort(unique(t[, col]))) {
    t_hot <- cbind(t_hot, ifelse(t[, col] == val, 1, 0))
    # make name for this column
    #
    colnames(t_hot)[ncol(t_hot)] <- paste0(col, "_", val)
  }
}

This returns:

     1_1 1_3 1_5 1_7 2_1 2_3 2_6 2_8 3_1 3_3 3_5 3_7 4_1 4_3 4_6 4_8 5_1 5_3 5_5 5_7
[1,]   0   0   1   0   0   1   0   0   0   0   1   0   0   1   0   0   0   0   1   0
[2,]   1   0   0   0   0   0   1   0   1   0   0   0   0   0   1   0   1   0   0   0
[3,]   0   1   0   0   0   0   0   1   0   1   0   0   0   0   0   1   0   1   0   0
[4,]   0   1   0   0   0   0   0   1   0   1   0   0   0   0   0   1   0   1   0   0
[5,]   0   0   0   1   1   0   0   0   0   0   0   1   1   0   0   0   0   0   0   1



回答2:


Your data table must contain some columns (variables) that have class "factor". Try this:

> t <- data.table(t)
> t[,V1:=factor(V1)]
> one_hot(t)
   V1_1 V1_3 V1_5 V1_7 V2 V3 V4 V5
1:    0    0    1    0  3  5  3  5
2:    1    0    0    0  6  1  6  1
3:    0    1    0    0  8  3  8  3
4:    0    1    0    0  8  3  8  3
5:    0    0    0    1  1  7  1  7

But I read that from here that the dummyVars function from the caret package is quicker if your matrix is large.

Edit: Forgot to set the seed. :P

And a quick way to factor all variables in a data table:

t.f <- t[, lapply(.SD, as.factor)]


来源:https://stackoverflow.com/questions/60263515/one-hot-encode-each-column-in-a-int-matrix-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!