Calculating the percentage of variance measure for k-means?

前端 未结 2 938
终归单人心
终归单人心 2020-12-07 12:39

On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not s

2条回答
  •  抹茶落季
    2020-12-07 13:32

    A simple cluster measure:
    1) draw "sunburst" rays from each point to its nearest cluster centre,
    2) look at the lengths — distance( point, centre, metric=... ) — of all the rays.

    For metric="sqeuclidean" and 1 cluster, the average length-squared is the total variance X.var(); for 2 clusters, it's less ... down to N clusters, lengths all 0. "Percent of variance explained" is 100 % - this average.

    Code for this, under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means:

    def distancestocentres( X, centres, metric="euclidean", p=2 ):
        """ all distances X -> nearest centre, any metric
                euclidean2 (~ withinss) is more sensitive to outliers,
                cityblock (manhattan, L1) less sensitive
        """
        D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|
        return D.min(axis=1)  # all the distances
    

    Like any long list of numbers, these distances can be looked at in various ways: np.mean(), np.histogram() ... Plotting, visualization, is not easy.
    See also stats.stackexchange.com/questions/tagged/clustering, in particular
    How to tell if data is “clustered” enough for clustering algorithms to produce meaningful results?

提交回复
热议问题