问题
I'm trying to create a one-hot representation of my data. This is my approach:
data(iris)
iris = as.data.frame(apply(iris, 2, function(x) as.factor(x)))
head(iris)
iris_ohe <- data.frame(model.matrix(~.-1, iris))
head(iris_ohe)
dim(iris_ohe)
The thing is, the data I'm working on has over 1 million rows, and doing the encoding, I get a matrix with over 100 columns. This is too much for R
and I run out of memory:
Error: cannot allocate vector of size 10204.5 Gb
Is there a better approach I could try?
回答1:
Try using mltools::one_hot
require(mltools)
require(data.table)
n <- 1e6
df1 <- data.table( ID= seq(1:n), replicate(99, sample(0:1,n,TRUE)))
one_hot(df1)
No memory issues for me and it runs almost instantly
回答2:
sparse.model.matrix
from the Matrix
package is a sparse equivalent for model.matrix
and avoids the cannot allocate vector problem.
来源:https://stackoverflow.com/questions/45764372/efficient-way-to-do-one-hot-encoding-in-r-on-large-data