clustering very large dataset in R

前端 未结 3 738
予麋鹿
予麋鹿 2020-12-09 05:40

I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I\'m trying the classica

3条回答
  •  遥遥无期
    2020-12-09 06:01

    You can use kmeans, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.

    ## Example
    # Data
    x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
               matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
    colnames(x) <- c("x", "y")
    
    # CAH without kmeans : dont work necessarily
    library(FactoMineR)
    cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)
    
    # CAH with kmeans : work quickly
    cl <- kmeans(x, 1000, iter.max=20)
    cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
    plot.HCPC(cah, choice="tree")
    

提交回复
热议问题