问题
I am trying to do some k-means clustering on a very large matrix.
The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row).
The whole thing does not fit into memory, so I converted it into a sparse ARFF file. But R obviously can't read the sparse ARFF file format. I also have the data as a plain CSV file.
Is there any package available in R for loading such sparse matrices efficiently? I'd then use the regular k-means algorithm from the cluster package to proceed.
Many thanks
回答1:
The bigmemory package (or now family of packages -- see their website) used k-means as running example of extended analytics on large data. See in particular the sub-package biganalytics which contains the k-means function.
回答2:
Please check:
library(foreign)
?read.arff
Cheers.
回答3:
sparkcl performs sparse hierarchical clustering and sparse k-means clustering This should be good for R-suitable (so - fitting into memory) matrices.
http://cran.r-project.org/web/packages/sparcl/sparcl.pdf
==
For really big matrices, I would try a solution with Apache Spark sparse matrices, and MLlib - still, do not know how experimental it is now:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$
https://spark.apache.org/docs/latest/mllib-clustering.html
回答4:
There's a special SparseM package for R that can hold it efficiently. If that doesn't work, I would try going to a higher performance language, like C.
来源:https://stackoverflow.com/questions/3039646/k-means-clustering-in-r-on-very-large-sparse-matrix