问题
I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there … That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R .
回答1:
I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)
回答2:
regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency. Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust
see R-Mailinglist
来源:https://stackoverflow.com/questions/19225102/select-the-most-dissimilar-individual-using-cluster-analysis