Understanding kmeans clustering in r [closed]

核能气质少年 提交于 2019-12-25 21:06:43

问题


Below code (minus my questions) generates this graph :

I have marked 4 areas of confusion with "->"

> m <- matrix(c(1,1,1) , ncol=3)
> 
> x <- rbind(matrix(c(1,0,1) , ncol=3),
+            matrix(c(1,1,1) , ncol=3),
+            matrix(c(1,1,0) , ncol=3),
+            matrix(c(0,1,1) , ncol=3),
+            matrix(c(0,0,1) , ncol=3),
+            matrix(c(0,0,0) , ncol=3),
+            matrix(c(1,1,1) , ncol=3),
+            matrix(c(1,1,1) , ncol=3),
+            matrix(c(1,1,0) , ncol=3),
+            matrix(c(1,0,0) , ncol=3),
+            matrix(c(0,0,1) , ncol=3),
+            matrix(c(0,0,0) , ncol=3),
+            matrix(c(0,0,1) , ncol=3),
+            matrix(c(0,1,1) , ncol=3),
+            matrix(c(1,0,1) , ncol=3),
+            matrix(c(0,1,0) , ncol=3))
> colnames(x) <- c("google", "stackoverflow", "tester")
> (cl <- kmeans(x, 3))

K-means clustering with 3 clusters of sizes 3, 10, 3
-> Where are sizes 3, 10 3 appearing  ?

Cluster means:
     google stackoverflow tester
1 0.6666667           1.0      0
2 0.5000000           0.5      1
3 0.3333333           0.0      0

-> There are three clusters, but what does each number signify ?

Clustering vector:
 [1] 2 2 1 2 2 3 2 2 1 3 2 3 2 2 2 1

-> This looks to be created by summing the values of each matrix but seems to be unordered as second element in this vector is '2' but second element in 'x' is matrix(c(1,1,1) , ncol=3) which is '3'

Within cluster sum of squares by cluster:
[1] 0.6666667 5.0000000 0.6666667
 (between_SS / total_SS =  46.1 %)

-> what are between_SS & total_SS ?

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"        
> plot(x, col = cl$cluster)
> points(cl$centers, col = 1:5, pch = 8, cex = 2)
> 

Can provide answers to these questions as from reading the implementation of this algorithm (http://en.wikipedia.org/wiki/K-means_clustering) I fail to see how r is computing these values


回答1:


1. What does the cluster sizes mean?

You provided 16 records and told kmeans to find 3 clusters. It clustered those 16 records into 3 groups of A: 3 records, B: 10 records and C: 3 records.

2. What are the cluster means?

These numbers signify the location in N-Dimensional space of the centroid (the "mean") of each cluster. You have three clusters, so you have three means. You have three dimensions ("google", "stackoverflow", "tester") so you get a value in each dimension. Reading the numbers across the row gives the location of a single centroid.

3. What is the Clustering vector?

This is the cluster label the algorithm is giving each record you passed the algorithm. Remember how earlier I said there were 3 clusters of size 3, 10, and 3? These clusters are labeled as 1, 2 and 3, and the algorithm stores the cluster label for each record in this vector. Here, you can see that there are 3 "1"s, 10 "2"s, and 3 "3"s. Does that make sense?

4. What are between_SS & total_SS?

This is notation generally used in ANOVA. You might find this helpful: http://www-ist.massey.ac.nz/dstirlin/CAST/CAST/HrandBlock/randBlock7.html



来源:https://stackoverflow.com/questions/17531114/understanding-kmeans-clustering-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!