On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not s
A simple cluster measure:
1) draw "sunburst" rays from each point to its nearest cluster centre,
2) look at the lengths — distance( point, centre, metric=... ) — of all the rays.
For metric="sqeuclidean"
and 1 cluster,
the average length-squared is the total variance X.var()
; for 2 clusters, it's less ... down to N clusters, lengths all 0.
"Percent of variance explained" is 100 % - this average.
Code for this, under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means:
def distancestocentres( X, centres, metric="euclidean", p=2 ):
""" all distances X -> nearest centre, any metric
euclidean2 (~ withinss) is more sensitive to outliers,
cityblock (manhattan, L1) less sensitive
"""
D = cdist( X, centres, metric=metric, p=p ) # |X| x |centres|
return D.min(axis=1) # all the distances
Like any long list of numbers, these distances can be looked at in various ways: np.mean(), np.histogram() ... Plotting, visualization, is not easy.
See also stats.stackexchange.com/questions/tagged/clustering, in particular
How to tell if data is “clustered” enough for clustering algorithms to produce meaningful results?