问题
Say I have a data set x and do the following kmeans cluster:
fit <- kmeans(x,2)
My question is in regards to the output of fit$cluster: I know that it will give me a vector of integers (from 1:k) indicating the cluster to which each point is allocated. Instead, is there a way to have the clusters be labeled 1,2, etc... in order of decreasing numerical value of their center?
For example: If x=c(1.5,1.4,1.45,.2,.3,.3) , then fit$cluster should result in (1,1,1,2,2,2) but not result in (2,2,2,1,1,1)
Similarly, if x=c(1.5,.2,1.45,1.4,.3,.3) then fit$cluster should return (1,2,1,1,2,2), instead of (2,1,2,2,1,1)
Right now, fit$cluster seems to label the cluster numbers randomly. I've looked into documentation but haven't been able to find anything. Please let me know if you can help!
回答1:
I had a similar problem. I had a vector of ages that I wanted to separate into 5 factor groups based on a logical ordinal set. I did the following:
I ran the k-means function:
k5 <- kmeans(all_data$age, centers = 5, nstart = 25)
I built a data frame of the k-means indexes and centres; then arranged it by centre value.
kmeans_index <- as.numeric(rownames(k5$centers))
k_means_centres <- as.numeric(k5$centers)
k_means_df <- data_frame(index=kmeans_index, centres=k_means_centres)
k_means_df <- k_means_df %>% 
    arrange(centres)
Now that the centres are in the df in ascending order, I created my 5 element factor list and bound it to the data frame:
factors <- c("very_young", "young", "middle_age", "old", "very_old")
k_means_df <- cbind(k_means_df, factors)
Looks like this:
> k_means_df
  index  centres    factors
1     2 23.33770 very_young
2     5 39.15239      young
3     1 55.31727 middle_age
4     4 67.49422        old
5     3 79.38353   very_old
I saved my cluster values in a data frame and created a dummy factor column:
cluster_vals <- data_frame(cluster=k5$cluster, factor=NA)
Finally, I iterated through the factor options in k_means_df and replaced the cluster value with my factor/character value within the cluster_vals data frame:
for (i in 1:nrow(k_means_df))
  {
    index_val <- k_means_df$index[i]
    factor_val <- as.character(k_means_df$factors[i])
    cluster_vals <- cluster_vals %>% 
      mutate(factor=replace(factor, cluster==index_val, factor_val))
  }
Voila; I now have a vector of factors/characters that were applied based on their ordinal logic to the randomly created cluster vector.
# A tibble: 3,163 x 2
   cluster factor    
     <int> <chr>     
 1       4 old       
 2       2 very_young
 3       2 very_young
 4       2 very_young
 5       3 very_old  
 6       3 very_old  
 7       4 old       
 8       4 old       
 9       2 very_young
10       5 young     
# ... with 3,153 more rows
Hope this helps.
回答2:
K-means is a randomized algorithm. It is actually correct when the labels are not consistent across runs, or ordered in "ascending" order. But you can of course remap the labels as you like, you know...
You seem to be using 1-dimensional data. Then k-means is actually not the best choice for you.
In contrast to 2- and higher-dimensional data, 1-dimensional data can efficiently be sorted. If your data is 1-dimensional, use an algorithm that exploits this for efficiency. There are much better algorithms for 1-dimensional data than for multivariate data.
来源:https://stackoverflow.com/questions/17685327/get-ordered-kmeans-cluster-labels