How to draw the plot of within-cluster sum-of-squares for a cluster?

前端 未结 1 1224
死守一世寂寞
死守一世寂寞 2020-12-28 11:13

I have a cluster plot by R while I want to optimize the \"elbow criterion\" of clustering with a wss plot, but I do not know how to draw a wss plot for a giving cluster, any

1条回答
  •  無奈伤痛
    2020-12-28 11:58

    If I follow what you want, then we need a function to compute WSS

    wss <- function(d) {
      sum(scale(d, scale = FALSE)^2)
    }
    

    and a wrapper for this wss() function

    wrap <- function(i, hc, x) {
      cl <- cutree(hc, i)
      spl <- split(x, cl)
      wss <- sum(sapply(spl, wss))
      wss
    }
    

    This wrapper takes the following arguments, inputs:

    • i the number of clusters to cut the data into
    • hc the hierarchical cluster analysis object
    • x the original data

    wrap then cuts the dendrogram into i clusters, splits the original data into the cluster membership given by cl and computes the WSS for each cluster. These WSS values are summed to give the WSS for that clustering.

    We run all of this using sapply over the number of clusters 1, 2, ..., nrow(data)

    res <- sapply(seq.int(1, nrow(data)), wrap, h = cl, x = data)
    

    A screeplot can be drawn using

    plot(seq_along(res), res, type = "b", pch = 19)
    

    Here is an example using the well-known Edgar Anderson Iris data set:

    iris2 <- iris[, 1:4]  # drop Species column
    cl <- hclust(dist(iris2), method = "ward.D")
    
    ## Takes a little while as we evaluate all implied clustering up to 150 groups
    res <- sapply(seq.int(1, nrow(iris2)), wrap, h = cl, x = iris2)
    plot(seq_along(res), res, type = "b", pch = 19)
    

    This gives:

    enter image description here

    We can zoom in by just showing the first 1:50 clusters

    plot(seq_along(res[1:50]), res[1:50], type = "o", pch = 19)
    

    which gives

    enter image description here

    You can speed up the main computation step by either running the sapply() via an appropriate parallelised alternative, or just do the computation for a fewer than nrow(data) clusters, e.g.

    res <- sapply(seq.int(1, 50), wrap, h = cl, x = iris2) ## 1st 50 groups
    

    0 讨论(0)
提交回复
热议问题