Calculating the percentage of variance measure for k-means?

前端未结

关注

 2  938

终归单人心 2020-12-07 12:39

On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not s

2条回答

抹茶落季 (楼主)

2020-12-07 13:32
A simple cluster measure:
1) draw "sunburst" rays from each point to its nearest cluster centre,
2) look at the lengths — distance( point, centre, metric=... ) — of all the rays.

For metric="sqeuclidean" and 1 cluster, the average length-squared is the total variance X.var(); for 2 clusters, it's less ... down to N clusters, lengths all 0. "Percent of variance explained" is 100 % - this average.

Code for this, under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means:
```
def distancestocentres( X, centres, metric="euclidean", p=2 ):
    """ all distances X -> nearest centre, any metric
            euclidean2 (~ withinss) is more sensitive to outliers,
            cityblock (manhattan, L1) less sensitive
    """
    D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|
    return D.min(axis=1)  # all the distances
```
Like any long list of numbers, these distances can be looked at in various ways: np.mean(), np.histogram() ... Plotting, visualization, is not easy.
See also stats.stackexchange.com/questions/tagged/clustering, in particular
How to tell if data is “clustered” enough for clustering algorithms to produce meaningful results?
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...