问题:

On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the distortion as they call it, is calculated.

More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph.

Assuming that I have the following points with their associated centroids, what is a good way of calculating this measure?

points = numpy.array([[ 0,  0],        [ 0,  1],        [ 0, -1],        [ 1,  0],        [-1,  0],        [ 9,  9],        [ 9, 10],        [ 9,  8],        [10,  9],        [10,  8]])  kmeans(pp,2) (array([[9, 8],    [0, 0]]), 0.9414213562373096)

I am specifically looking at computing the 0.94.. measure given just the points and the centroids. I am not sure if any of the inbuilt methods of scipy can be used or I have to write my own. Any suggestions on how to do this efficiently for large number of points?

In short, my questions (all related) are the following:

Given a distance matrix and a mapping of which point belongs to which cluster, what is a good way of computing a measure that can be used to draw the elbow plot?
How would the methodology change if a different distance function such as cosine similarity is used?

EDIT 2: Distortion

from scipy.spatial.distance import cdist D = cdist(points, centroids, 'euclidean') sum(numpy.min(D, axis=1))

The output for the first set of points is accurate. However, when I try a different set:

>>> pp = numpy.array([[1,2], [2,1], [2,2], [1,3], [6,7], [6,5], [7,8], [8,8]]) >>> kmeans(pp, 2) (array([[6, 7],        [1, 2]]), 1.1330618877807475) >>> centroids = numpy.array([[6,7], [1,2]]) >>> D = cdist(points, centroids, 'euclidean') >>> sum(numpy.min(D, axis=1)) 9.0644951022459797

I guess the last value does not match because kmeans seems to be diving the value by the total number of points in the dataset.

EDIT 1: Percent Variance

My code so far (should be added to Denis's K-means implementation):

centres, xtoc, dist = kmeanssample( points, 2, nsample=2,         delta=kmdelta, maxiter=kmiter, metric=metric, verbose=0 )  print "Unique clusters: ", set(xtoc) print "" cluster_vars = [] for cluster in set(xtoc):     print "Cluster: ", cluster      truthcondition = ([x == cluster for x in xtoc])     distances_inside_cluster = (truthcondition * dist)      indices = [i for i,x in enumerate(truthcondition) if x == True]     final_distances = [distances_inside_cluster[k] for k in indices]      print final_distances     print np.array(final_distances).var()     cluster_vars.append(np.array(final_distances).var())     print ""  print "Sum of variances: ", sum(cluster_vars) print "Total Variance: ", points.var() print "Percent: ", (100 * sum(cluster_vars) / points.var())

And following is the output for k=2:

Unique clusters:  set([0, 1])  Cluster:  0 [1.0, 2.0, 0.0, 1.4142135623730951, 1.0] 0.427451660041  Cluster:  1 [0.0, 1.0, 1.0, 1.0, 1.0] 0.16  Sum of variances:  0.587451660041 Total Variance:  21.1475 Percent:  2.77787757437

On my real dataset (does not look right to me!):

Sum of variances:  0.0188124746402 Total Variance:  0.00313754329764 Percent:  599.592510943 Unique clusters:  set([0, 1, 2, 3])  Sum of variances:  0.0255808508714 Total Variance:  0.00313754329764 Percent:  815.314672809 Unique clusters:  set([0, 1, 2, 3, 4])  Sum of variances:  0.0588210052519 Total Variance:  0.00313754329764 Percent:  1874.74720416 Unique clusters:  set([0, 1, 2, 3, 4, 5])  Sum of variances:  0.0672406353655 Total Variance:  0.00313754329764 Percent:  2143.09824556 Unique clusters:  set([0, 1, 2, 3, 4, 5, 6])  Sum of variances:  0.0646291452839 Total Variance:  0.00313754329764 Percent:  2059.86465055 Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7])  Sum of variances:  0.0817517362176 Total Variance:  0.00313754329764 Percent:  2605.5970695 Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8])  Sum of variances:  0.0912820650486 Total Variance:  0.00313754329764 Percent:  2909.34837831 Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])  Sum of variances:  0.102119601368 Total Variance:  0.00313754329764 Percent:  3254.76309585 Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])  Sum of variances:  0.125549475536 Total Variance:  0.00313754329764 Percent:  4001.52168834 Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])  Sum of variances:  0.138469402779 Total Variance:  0.00313754329764 Percent:  4413.30651542 Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

回答1:

The distortion, as far as Kmeans is concerned, is used as a stopping criterion (if the change between two iterations is less than some threshold, we assume convergence)

If you want to calculate it from a set of points and the centroids, you can do the following (the code is in MATLAB using pdist2 function, but it should be straightforward to rewrite in Python/Numpy/Scipy):

% data X = [0 1 ; 0 -1 ; 1 0 ; -1 0 ; 9 9 ; 9 10 ; 9 8 ; 10 9 ; 10 8];  % centroids C = [9 8 ; 0 0];  % euclidean distance from each point to each cluster centroid D = pdist2(X, C, 'euclidean');  % find closest centroid to each point, and the corresponding distance [distortions,idx] = min(D,[],2);

the result:

% total distortion >> sum(distortions) ans =            9.4142135623731

EDIT#1:

I had some time to play around with this.. Here is an example of KMeans clustering applied on the 'Fisher Iris Dataset' (4 features, 150 instances). We iterate over k=1..10, plot the elbow curve, pick K=3 as number of clusters, and show a scatter plot of the result.

Note that I included a number of ways to compute the within-cluster variances (distortions), given the points and the centroids. The scipy.cluster.vq.kmeans function returns this measure by default (computed with Euclidean as a distance measure). You can also use the scipy.spatial.distance.cdist function to calculate the distances with the function of your choice (provided you obtained the cluster centroids using the same distance measure: @Denis have a solution for that), then compute the distortion from that.

import numpy as np from scipy.cluster.vq import kmeans,vq from scipy.spatial.distance import cdist import matplotlib.pyplot as plt  # load the iris dataset fName = 'C:\\Python27\\Lib\\site-packages\\scipy\\spatial\\tests\\data\\iris.txt' fp = open(fName) X = np.loadtxt(fp) fp.close()  ##### cluster data into K=1..10 clusters ##### K = range(1,10)  # scipy.cluster.vq.kmeans KM = [kmeans(X,k) for k in K] centroids = [cent for (cent,var) in KM]   # cluster centroids #avgWithinSS = [var for (cent,var) in KM] # mean within-cluster sum of squares  # alternative: scipy.cluster.vq.vq #Z = [vq(X,cent) for cent in centroids] #avgWithinSS = [sum(dist)/X.shape[0] for (cIdx,dist) in Z]  # alternative: scipy.spatial.distance.cdist D_k = [cdist(X, cent, 'euclidean') for cent in centroids] cIdx = [np.argmin(D,axis=1) for D in D_k] dist = [np.min(D,axis=1) for D in D_k] avgWithinSS = [sum(d)/X.shape[0] for d in dist]  ##### plot ### kIdx = 2  # elbow curve fig = plt.figure() ax = fig.add_subplot(111) ax.plot(K, avgWithinSS, 'b*-') ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12,      markeredgewidth=2, markeredgecolor='r', markerfacecolor='None') plt.grid(True) plt.xlabel('Number of clusters') plt.ylabel('Average within-cluster sum of squares') plt.title('Elbow for KMeans clustering')  # scatter plot fig = plt.figure() ax = fig.add_subplot(111) #ax.scatter(X[:,2],X[:,1], s=30, c=cIdx[k]) clr = ['b','g','r','c','m','y','k'] for i in range(K[kIdx]):     ind = (cIdx[kIdx]==i)     ax.scatter(X[ind,2],X[ind,1], s=30, c=clr[i], label='Cluster %d'%i) plt.xlabel('Petal Length') plt.ylabel('Sepal Width') plt.title('Iris Dataset, KMeans clustering with K=%d' % K[kIdx]) plt.legend()  plt.show()

EDIT#2:

In response to the comments, I give below another complete example using the NIST hand-written digits dataset: it has 1797 images of digits from 0 to 9, each of size 8-by-8 pixels. I repeat the experiment above slightly modified: Principal Components Analysis is applied to reduce the dimensionality from 64 down to 2:

import numpy as np from scipy.cluster.vq import kmeans from scipy.spatial.distance import cdist,pdist from sklearn import datasets from sklearn.decomposition import RandomizedPCA from matplotlib import pyplot as plt from matplotlib import cm  ##### data ##### # load digits dataset data = datasets.load_digits() t = data['target']  # perform PCA dimensionality reduction pca = RandomizedPCA(n_components=2).fit(data['data']) X = pca.transform(data['data'])  ##### cluster data into K=1..20 clusters ##### K_MAX = 20 KK = range(1,K_MAX+1)  KM = [kmeans(X,k) for k in KK] centroids = [cent for (cent,var) in KM] D_k = [cdist(X, cent, 'euclidean') for cent in centroids] cIdx = [np.argmin(D,axis=1) for D in D_k] dist = [np.min(D,axis=1) for D in D_k]  tot_withinss = [sum(d**2) for d in dist]  # Total within-cluster sum of squares totss = sum(pdist(X)**2)/X.shape[0]       # The total sum of squares betweenss = totss - tot_withinss          # The between-cluster sum of squares  ##### plots ##### kIdx = 9        # K=10 clr = cm.spectral( np.linspace(0,1,10) ).tolist() mrk = 'os^p+x.'  # elbow curve fig = plt.figure() ax = fig.add_subplot(111) ax.plot(KK, betweenss/totss*100, 'b*-') ax.plot(KK[kIdx], betweenss[kIdx]/totss*100, marker='o', markersize=12,      markeredgewidth=2, markeredgecolor='r', markerfacecolor='None') ax.set_ylim((0,100)) plt.grid(True) plt.xlabel('Number of clusters') plt.ylabel('Percentage of variance explained (%)') plt.title('Elbow for KMeans clustering')  # show centroids for K=10 clusters plt.figure() for i in range(kIdx+1):     img = pca.inverse_transform(centroids[kIdx][i]).reshape(8,8)     ax = plt.subplot(3,4,i+1)     ax.set_xticks([])     ax.set_yticks([])     plt.imshow(img, cmap=cm.gray)     plt.title( 'Cluster %d' % i )  # compare K=10 clustering vs. actual digits (PCA projections) fig = plt.figure() ax = fig.add_subplot(121) for i in range(10):     ind = (t==i)     ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='%d'%i) plt.legend() plt.title('Actual Digits') ax = fig.add_subplot(122) for i in range(kIdx+1):     ind = (cIdx[kIdx]==i)     ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='C%d'%i) plt.legend() plt.title('K=%d clusters'%KK[kIdx])  plt.show()

You can see how some clusters actually correspond to distinguishable digits, while others don't match a single number.

Note: An implementation of K-means is included in scikit-learn (as well as many other clustering algorithms and various clustering metrics). Here is another similar example.

回答2:

A simple cluster measure:
1) draw "sunburst" rays from each point to its nearest cluster centre,
2) look at the lengths ― distance( point, centre, metric=... ) ― of all the rays.

For metric="sqeuclidean" and 1 cluster, the average length-squared is the total variance X.var(); for 2 clusters, it's less ... down to N clusters, lengths all 0. "Percent of variance explained" is 100 % - this average.

Code for this, under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means:

def distancestocentres( X, centres, metric="euclidean", p=2 ):     """ all distances X -> nearest centre, any metric             euclidean2 (~ withinss) is more sensitive to outliers,             cityblock (manhattan, L1) less sensitive     """     D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|     return D.min(axis=1)  # all the distances

Like any long list of numbers, these distances can be looked at in various ways: np.mean(), np.histogram() ... Plotting, visualization, is not easy.
See also stats.stackexchange.com/questions/tagged/clustering, in particular
How to tell if data is “clustered” enough for clustering algorithms to produce meaningful results?

转载请标明出处:Calculating the percentage of variance measure for k-means?

文章来源: Calculating the percentage of variance measure for k-means?

标签

kmeans