Compute dissimilarity matrix for large data

问题

I'm trying to compute a dissimilarity matrix based on a big data frame with both numerical and categorical features. When I run the daisy function from the cluster package I get the error message:

Error: cannot allocate vector of size X.

In my case X is about 800 GB. Any idea how I can deal with this problem? Additionally it would be also great if someone could help me to run the function in parallel cores. Below you can find the function that computes the dissimilarity matrix on the iris dataset:

require(cluster)
d <- daisy(iris)

回答1:

I've had a similar issue before. Running daisy() on even 5k rows of my dataset took a really long time.

I ended up using the kmeans algorithm in the h2o package which parallelizes and 1-hot encodes categorical data. I would just make sure to center and scale your data (mean 0 w/ stdev = 1) before plugging it into h2o.kmeans. This is so that the clustering algorithm doesn't prioritize columns that have high nominal differences (since it's trying to minimize the distance calculation). I used the scale() function.

After installing h2o:

h2o.init(nthreads = 16, min_mem_size = '150G')
h2o.df <- as.h2o(df)
h2o_kmeans <- h2o.kmeans(training_frame = h2o.df, x = vars, k = 5, estimate_k = FALSE, seed = 1234)
summary(h2o_kmeans)

来源：https://stackoverflow.com/questions/47590287/compute-dissimilarity-matrix-for-large-data

标签

cluster-computing

r-daisy

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!