问题
I'm trying to extract a classification from a dendrogram in R that I've cut at a certain height. This is easy to do with cutree on an hclustobject, but I can't figure out how to do it on a dendrogram object.
Further, I can't just use my clusters from the original hclust, becuase (frustratingly), the numbering of the classes from cutree is different from the numbering of classes with cut.
hc <- hclust(dist(USArrests), "ave")
classification<-cutree(hc,h=70)
dend1 <- as.dendrogram(hc)
dend2 <- cut(dend1, h = 70)
str(dend2$lower[[1]]) #group 1 here is not the same as
classification[classification==1] #group 1 here
Is there a way to either get the classifications to map to each other, or alternatively to extract lower branch memberships from the dendrogram object (perhaps with some clever use of dendrapply?) in a format more like what cutree gives?
回答1:
I would propose for you to use the cutree function from the dendextend package. It includes a dendrogram method (i.e.: dendextend:::cutree.dendrogram).
You can learn more about the package from its introductory vignette.
I should add that while your function (classify) is good, there are several advantage for using cutree from dendextend:
It also allows you to use a specific
k(number of clusters), and not justh(a specific height).It is consistent with the result you would get from cutree on hclust (
classifywill not be).It will often be faster.
Here are examples for using the code:
# Toy data:
hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)
# Get the package:
install.packages("dendextend")
library(dendextend)
# Get the package:
cutree(dend1,h=70) # it now works on a dendrogram
# It is like using:
dendextend:::cutree.dendrogram(dend1,h=70)
By the way, on the basis of this function, dendextend allows the user to do more cool things, like color branches/labels based on cutting the dendrogram:
dend1 <- color_branches(dend1, k = 4)
dend1 <- color_labels(dend1, k = 5)
plot(dend1)
Lastly, here is some more code for demonstrating my other points:
# This would also work with k:
cutree(dend1,k=4)
# and would give identical result as cutree on hclust:
identical(cutree(hc,h=70) , cutree(dend1,h=70) )
# TRUE
# But this is not the case for classify:
identical(classify(dend1,70) , cutree(dend1,h=70) )
# FALSE
install.packages("microbenchmark")
require(microbenchmark)
microbenchmark(classify = classify(dend1,70),
cutree = cutree(dend1,h=70) )
# Unit: milliseconds
# expr min lq median uq max neval
# classify 9.70135 9.94604 10.25400 10.87552 80.82032 100
# cutree 37.24264 37.97642 39.23095 43.21233 141.13880 100
# 4 times faster for this tree (it will be more for larger trees)
# Although (if to be exact about it) if I force cutree.dendrogram to not go through hclust (which can happen for "weird" trees), the speed will remain similar:
microbenchmark(classify = classify(dend1,70),
cutree = cutree(dend1,h=70, try_cutree_hclust = FALSE) )
# Unit: milliseconds
# expr min lq median uq max neval
# classify 9.683433 9.819776 9.972077 10.48497 29.73285 100
# cutree 10.275839 10.419181 10.540126 10.66863 16.54034 100
If you are thinking of ways to improve this function, please patch it through here:
https://github.com/talgalili/dendextend/blob/master/R/cutree.dendrogram.R
I hope you, or others, will find this answer helpful.
回答2:
I ended up creating a function to do it using dendrapply. It's not elegant, but it works
classify <- function(dendrogram,height){
#mini-function to use with dendrapply to return tip labels
members <- function(n) {
labels<-c()
if (is.leaf(n)) {
a <- attributes(n)
labels<-c(labels,a$label)
}
labels
}
dend2 <- cut(dendrogram,height) #the cut dendrogram object
branchesvector<-c()
membersvector<-c()
for(i in 1:length(dend2$lower)){ #for each lower tree resulting from the cut
memlist <- unlist(dendrapply(dend2$lower[[i]],members)) #get the tip lables
branchesvector <- c(branchesvector,rep(i,length(memlist))) #add the lower tree identifier to a vector
membersvector <- c(membersvector,memlist) #add the tip labels to a vector
}
out<-as.integer(branchesvector) #make the output a list of named integers, to match cut() output
names(out)<-membersvector
out
}
Using the function makes it clear that the problem is that cut assigns category names alphabetically while cutree assigns branch names left to right.
hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)
classify(dend1,70) #Florida 1, North Carolina 1, etc.
cutree(hc,h=70) #Alabama 1, Arizona 1, Arkansas 1, etc.
来源:https://stackoverflow.com/questions/25452472/extract-labels-membership-classification-from-a-cut-dendrogram-in-r-i-e-a-c