Understanding output from kmeans clustering in python

问题

I have two distance matrices, each 232*232 where the column and row labels are identical. So this would be an abridged version of the two where A, B, C and D are the names of the points between which the distances are measured:

  A  B  C  D ...    A  B  C  D  ...
A 0  1  5  3      A 0  5  3  9
B 4  0  4  1      B 2  0  7  8  
C 2  6  0  3      C 2  6  0  1
D 2  7  1  0      D 5  2  5  0
...               ...

The two matrices therefore represent the distances between pairs of points in two different networks. I want to identify clusters of pairs that are close together in one network and far apart in the other. I attempted to do this by first adjusting the distances in each matrix by dividing every distance by the largest distance in the matrix. I then subtracted one matrix from the other and applied a clustering algorithm to the resultant matrix. The algorithm I was advised to use for this was the k means algorithm. The hope was that I could identify clusters of positive numbers that would correspond to pairs that were very close in matrix one and far apart in matrix two and vice versa for clusters of negative numbers.

Firstly, I've read quite a bit about how to implement k means in python I'm aware that there are multiple different modules that can be used. I've tried all three of these:

import sklearn.cluster
import numpy as np

data = np.load('difference_matrix_file.npy') #loads difference matrix from file

a = np.array([x[0:] for x in data])
clust_centers = 3

model = sklearn.cluster.k_means(a, clust_centers)
print model

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.cluster import KMeans

difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file

data = pd.DataFrame(difference_matrix)
model = KMeans(n_clusters=3)
print model.fit(data)

import numpy as np
from scipy.cluster.vq import vq, kmeans, whiten

np.set_printoptions(threshold=np.nan)

difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file

whitened = whiten(difference_matrix) 
centroids = kmeans(whitened, 3) 
print centroids

What I'm struggling with is how to interpret the output from these scripts. (I might add at this point that I'm neither a mathematician nor a computer scientist if the reader hadn't already guessed). I was expecting the output of the algorithm to be lists of coordinates of clustered pairs, one for each cluster so three in this case, that I could then trace back to my two original matrices and identify the names of the pairs of interest.

However what I get is an array containing a list of numbers (one for each cluster) but I don't really understand what these numbers are, they don't obviously correspond to what I had in my input matrix other than the fact that there is 232 items in each list which is the same number of rows and columns there are in the input matrix. And the list item in the array is another single number which I presume must be the centroid of the clusters, but there isn't one for each cluster, just one for the whole array.

I've been trying to figure this out for quite a while now but I'm struggling to get anywhere. Whenever I search for interpreting the output of kmeans I just get explanations of how to plot my clusters on a graph which isn't what I want to do. Please can someone explain to me what I'm seeing in my output and how I can get from this to the coordinates of the items in each cluster?

回答1:

You have two issues where, and the recommendation of k-means probably was not very good...

K-means expects a coordinate data matrix, not a distance matrix.

In order to compute a centroid, it needs the original coordinates. If you don't have coordinates like this, you probably should not be using k-means.
If you compute the difference of two distance matrixes, small values correspond to points that have a similar distance in both. These could still be very far away from each other! So if you use this matrix as a new "distance" matrix, you will get meaningless results. Consider points A and B, which have the maximum distance in both original graphs. After your procedure, they will have a difference of 0, and will thus be considered identical now.

So you haven't understood the input of k-means, no wonder you do not understand the output.

I'd rather treat the difference matrix as a similarity matrix (try absolute values, positives only, negatives only). Then use hierarchical clustering. But you will need an implementation for a similarity, the usual implementations for a distance matrix will not work.

回答2:

Disclaimer: below, I tried to answer your question about how to interpret what the functions return and how to get the points in a cluster from that. I agree with @Anony-Mousse in that if you have a distance / similarity matrix (as opposed to a feature matrix), you will want to use different techniques, such as spectral clustering.

Sorry for being blunt, I also hate the "RTFM"-type answers, but the functions you used are well documented at:

sklearn.cluster,
scipy.cluster.vq?

In short,

the model sklearn.cluster.k_means() returns a tuple with three fields:
- an array with the centroids (that should be 3x232 for you)
- the label assignment for each point (i.e. a 232-long array with values 0-2)
- and "intertia", a measure of how good the clustering is; there are several measures for that, so you might be better off not paying too much attention to this;
scipy.cluster.vq.kmeans2() returns a tuple with two fields:
- the cluster centroids (as above)
- the label assignment (as above)
- kmeans() returns a "distortion" value instead of the label assignment, so I would definitely use kmeans2().

As for how to get to the coordinates of the points in each cluster, you could:

for cc in range(clust_centers):
    print('Points for cluster {}:\n{}'.format(cc, data[model[1] == cc]))

where model is the tuple returned by either sklearn.cluster.k_means or scipy.cluster.vq.kmeans2, and data is a points x coordinates array, difference_matrix in your case.

来源：https://stackoverflow.com/questions/43228355/understanding-output-from-kmeans-clustering-in-python

标签

python

matrix

k-means