clustering very large dataset in R

可紊 提交于 2019-11-28 23:54:21

You can use kmeans, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.

## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")

70000 is not large. It's not small, but it's also not particularly large... The problem is the limited scalability of matrix-oriented approaches.

But there are plenty of clustering algorithms which do not use matrixes and do no need O(n^2) (or even worse, O(n^3)) runtime.

You may want to try ELKI, which has great index support (try the R*-tree with SortTimeRecursive bulk loading). The index support makes it a lot lot lot faster.

If you insist on using R, give at least kmeans a try and the fastcluster package. K-means has runtime complexity O(n*k*i) (where k is the parameter k, and i is the number of iterations); fastcluster has an O(n) memory and O(n^2) runtime implementation of single-linkage clustering comparable to the SLINK algorithm in ELKI. (The R "agnes" hierarchical clustering will use O(n^3) runtime and O(n^2) memory).

Implementation matters. Often, implementations in R aren't the best IMHO, except for core R which usually at least has a competitive numerical precision. But R was built by statisticians, not by data miners. It's focus is on statistical expressiveness, not on scalability. So the authors aren't to blame. It's just the wrong tool for large data.

Oh, and if your data is 1-dimensional, don't use clustering at all. Use kernel density estimation. 1 dimensional data is special: it's ordered. Any good algorithm for breaking 1-dimensional data into inverals should exploit that you can sort the data.

Another non-matrix oriented approach, at least for visualizing cluster in big data, is the largeVis algorithm by Tang et al. (2016). The largeVis R package has unfortunately been orphaned on CRAN due to lacking package maintenance, but a (maintained?) version can still be compiled from its gitHub repository via (having installed Rtools), e.g.,

library(devtools)     
install_github(repo = "elbamos/largeVis")

A python version of the package exists as well. The underlying algorithm uses segmentation trees and a neigbourhood refinement to find the K most similar instances for each observation and then projects the resulting neigbourhood network into dim lower dimensions. Its been implemented in C++ and uses OpenMP (if supported while compiling) for multi-processing; it has thus been sufficiently fast for clustering any larger data sets I have tested so far.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!