k-means clustering in R on very large, sparse matrix?

问题

I am trying to do some k-means clustering on a very large matrix.

The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row).

The whole thing does not fit into memory, so I converted it into a sparse ARFF file. But R obviously can't read the sparse ARFF file format. I also have the data as a plain CSV file.

Is there any package available in R for loading such sparse matrices efficiently? I'd then use the regular k-means algorithm from the cluster package to proceed.

Many thanks

回答1:

The bigmemory package (or now family of packages -- see their website) used k-means as running example of extended analytics on large data. See in particular the sub-package biganalytics which contains the k-means function.

回答2:

Please check:

library(foreign)
?read.arff

Cheers.

回答3:

sparkcl performs sparse hierarchical clustering and sparse k-means clustering This should be good for R-suitable (so - fitting into memory) matrices.

http://cran.r-project.org/web/packages/sparcl/sparcl.pdf

For really big matrices, I would try a solution with Apache Spark sparse matrices, and MLlib - still, do not know how experimental it is now:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$

https://spark.apache.org/docs/latest/mllib-clustering.html

回答4:

There's a special SparseM package for R that can hold it efficiently. If that doesn't work, I would try going to a higher performance language, like C.

来源：https://stackoverflow.com/questions/3039646/k-means-clustering-in-r-on-very-large-sparse-matrix

标签

cluster-analysis

sparse-matrix