R cut dendrogram into groups with minimum size

无人久伴 提交于 2020-01-01 03:22:30


Is there an easy way to calculate lowest value of h in cut that produces groupings of a given minimum size?

In this example, if I wanted clusters with at least ten members each, I should go with h = 3.80:

# using iris data simply for reproducible example
d <- data.frame(scale(iris[,1:4]))
hc <- hclust(dist(d))

cut(as.dendrogram(hc), h=3.79) # produces 5 groups; group 4 has 7 members

cut(as.dendrogram(hc), h=3.80) # produces 4 groups; no group has <10 members

Since the heights of the splits are given in hc$height, I could create a set of candidate values using hc$height + 0.00001 and then loop through cuts at each of them. However, I don't see how to parse the cluster size members out of the dendrogram class. For example, cut(as.dendrogram(hc), h=3.80)$lower[[1]]$members returns NULL, not 66 as desired.

Please note that this is a simpler question than Cutting dendrogram into n trees with minimum cluster size in R which uses the package dynamicTreeCut; here I am not specifying number of trees, just minimum cluster size. TYVM.


Thanks to @Vlo and @lukeA I'm able to implement a loop. However, I am just posting this for a starting point and certainly open to a more elegant solution.

unnest <- function(x) { # from Vlo's answer
  if(is.null(names(x))) x
  else c(list(all=unname(unlist(x))), do.call(c, lapply(x, unnest)))

cuts <- hc$height + 1e-9

min_size <- 10
smallest <- 0
i <- 0

while(smallest < min_size & i <= length(cuts)){
  h_i <- cuts[i <- i+1]
  if(i > length(cuts)){
    warning("Couldn't find a cluster big enough.")
  else  smallest <- 
                  lapply(X = unnest(cut(as.dendrogram(hc), h=h_i)$lower), 
                         FUN = attr, which = "members") ) # from lukeA's comment
h_i # returns desired output: [1] 3.79211


This feature is available in the dendextend package with the heights_per_k.dendrogram function (which also has a faster C++ implementation when loading the dendextendRcpp function).

## Not run: 
hc <- hclust(dist(USArrests[1:4,]), "ave")
dend <- as.dendrogram(hc)
##       1        2        3        4
##86.47086 68.84745 45.98871 28.36531

As a sidenote, the dendextend package has a cutree.dendrogram S3 method for dendrograms (which works very similarly to cutree for hclust objects).


This doesn't answer the question, but might be useful for members extraction if you decide to loop through the h.

Stealing and modifying some code from here

# Unnest the list/dendogram structure
unnest <- function(x) {
  if(is.null(names(x))) {
  else {
    c(list(all=unname(unlist(x))), do.call(c, lapply(x, unnest)))

# Extract the `members` attribute from each dendogram
lapply(X = unnest(cut(as.dendrogram(hc), h=3.8)), FUN = attr, which = "members")


# Please don't ask me why there are 2 dendograms stored
# in the `$upper` list while `print` displays one

[1] 2

[1] 2

[1] 66

[1] 11

[1] 24

[1] 49

