How can I one hot encode multiple variables with big data in R?

*爱你&永不变心* 提交于 2019-12-01 06:09:32

问题


I currently have a dataframe with 260,000 rows and 50 columns where 3 columns are numeric and the rest are categorical. I wanted to one hot encode the categorical columns in order to perform PCA and use regression to predict the class. How can I go about accomplishing the below example in R?

Example:
V1 V2 V3 V4 V5 .... VN-1 VN

to

V1_a V1_b V2_a V2_b V2_c V3_a V3_b and so on

回答1:


You can use model.matrix or sparse.model.matrix. Something like this:

sparse.model.matrix(~. -1, data = your_data)

The ~. tells R that your entire table (the .) is the right hand side of some hypothetical model, and the -1 says to leave out the intercept. Without the -1 your first column will be a vector of 1s.




回答2:


Don't really what you mean by "hot encode".

Here's an example of using dplyr to spread out the catagorical variable iris$Species into three separate columns:

df <- iris %>% 
        mutate(id = rownames(.) %>%  # unique identified to prevent duplicate rows when spreading
        mutate(val=1) %>% # give the categorical variable a value of 1
       spread(Species, val) # spread out each level of iris$Species as columns

 df[76:80,]

   Sepal.Length Sepal.Width Petal.Length Petal.Width  id setosa versicolor virginica
76          5.8         2.7          4.1         1.0  68     NA          1        NA
77          5.8         2.7          5.1         1.9 102     NA         NA         1
78          5.8         2.7          5.1         1.9 143     NA         NA         1
79          5.8         2.8          5.1         2.4 115     NA         NA         1
80          5.8         4.0          1.2         0.2  15      1         NA        NA



回答3:


Basically a oneliner with data.table and mltools:

# data.table with 125 variables:
dt_1h <- one_hot(dt)

# MD5 for checking reproducibility:
> digest::digest(dt_1h, algo = "md5")
[1] "f1eb1c1e2d5d94b709101557c9ed8d0d"

Data

library(data.table)
library(mltools)
set.seed(1701)
df <- data.frame(matrix(sample(c(LETTERS[1:26]),
                               260000*3, replace = TRUE), ncol = 3),
                 matrix(rnorm(260000*47), ncol = 47))
dt <- as.data.table(df)    


来源:https://stackoverflow.com/questions/43578647/how-can-i-one-hot-encode-multiple-variables-with-big-data-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!