K-means with really large matrix

 ̄綄美尐妖づ 提交于 2019-11-30 13:09:25

Does it have to be K-means? Another possible approach is to transform your data into a network first, then apply graph clustering. I am the author of MCL, an algorithm used quite often in bioinformatics. The implementation linked to should easily scale up to networks with millions of nodes - your example would have 300K nodes, assuming that you have 100K attributes. With this approach, the data will be naturally pruned in the data transformation step - and that step will quite likely become the bottleneck. How do you compute the distance between two vectors? In the applications that I have dealt with I used the Pearson or Spearman correlation, and MCL is shipped with software to efficiently perform this computation on large scale data (it can utilise multiple CPUs and multiple machines).

There is still an issue with the data size, as most clustering algorithms will require you to at least perform all pairwise comparisons at least once. Is your data really stored as a giant matrix? Do you have many zeros in the input? Alternatively, do you have a way of discarding smaller elements? Do you have access to more than one machine in order to distribute these computations?

I keep the link (that can be useful to the specific user) but I agree with Gavin's comment! To perform a k-means clustering on Big Data you can use the rxKmeans function implemented in the Revolution R Enterprise proprietary implementation of R (I know this can be a problem); this function seems to be capable of manage that kind of data.

denis

Since we know nothing at all about the data, nor the questioner's goals for it, just a couple of general links:
I. Guyon's video lectures — many papers and books too.
feature selection on stats.stackexchange

Check out Mahout, it will do k means on a large data set:

http://mahout.apache.org/

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!