问题
I have searched here and googled, but to no avail. When clustering in Weka there is a handy option, classes to clusters, which matches up the clusters produced by the algorithm e.g. simple k-means, to the 'ground truth' class labels you supply as the class attribute. So that we can see cluster accuracy (% incorrect).
Now, how can I achieve this in Matlab, i.e. translate my clusterClasses vector e.g. [1, 1, 2, 1, 3, 2, 3, 1, 1, 1] into the same index as the supplied ground truth labels vector e.g. [2, 2, 2, 3, 1, 3]?
I think it is probably based on cluster centres and label centres, but I'm not sure how to implement!
Any help would be greatly appreciated.
Vincent
回答1:
I stumbled on a similar problem a couple of months ago while doing clustering. I did not search for built in solutions very long (although I am sure they must exist) and ended up writing my own little script for matching my found labels the best with the ground truth. The code is very crude, but it should get you started.
It is based on trying all possible rearrangements of the labels to see witch best fit the truth vector. That means that given a clustering result yte = [3 3 2 1] with ground truth y = [1 1 2 3], the script will try to match [3 3 2 1], [3 3 1 2], [2 2 3 1], [2 2 1 3], [1 1 2 3] and [1 1 3 2] with y to find the best match.
This is based on using the built in script perms() witch can not handle more than 10 unique clusters. The code can also tend to be slow for 7-10 unique clusters, as the complexity grows as a factorial. 
function [accuracy, true_labels, CM] = calculateAccuracy(yte, y)
%# Function for calculating clustering accuray and matching found 
%# labels with true labels. Assumes yte and y both are Nx1 vectors with
%# clustering labels. Does not support fuzzy clustering.
%#
%# Algorithm is based on trying out all reorderings of cluster labels, 
%# e.g. if yte = [1 2 2], try [1 2 2] and [2 1 1] so see witch fit 
%# the truth vector the best. Since this approach makes use of perms(),
%# the code will not run for unique(yte) greater than 10, and it will slow
%# down significantly for number of clusters greater than 7.
%#
%# Input:
%#   yte - result from clustering (y-test)
%#   y   - truth vector
%#
%# Output:
%#   accuracy    -   Overall accuracy for entire clustering (OA). For
%#                   overall error, use OE = 1 - OA.
%#   true_labels -   Vector giving the label rearangement witch best 
%#                   match the truth vector (y).
%#   CM          -   Confusion matrix. If unique(yte) = 4, produce a
%#                   4x4 matrix of the number of different errors and  
%#                   correct clusterings done.
N = length(y);
cluster_names = unique(yte);
accuracy = 0;
maxInd = 1;
perm = perms(unique(y));
[pN pM] = size(perm);
true_labels = y;
for i=1:pN
    flipped_labels = zeros(1,N);
    for cl = 1 : pM
        flipped_labels(yte==cluster_names(cl)) = perm(i,cl);
    end
    testAcc = sum(flipped_labels == y')/N;
    if testAcc > accuracy
        accuracy = testAcc;
        maxInd = i;
        true_labels = flipped_labels;
    end
end
CM = zeros(pM,pM);
for rc = 1 : pM
    for cc = 1 : pM
        CM(rc,cc) = sum( ((y'==rc) .* (true_labels==cc)) );
    end
end
Example:
[acc newLabels CM] = calculateAccuracy([3 2 2 1 2 3]',[1 2 2 3 3 3]')
acc =
0.6667
newLabels =
 1     2     2     3     2     1
CM =
 1     0     0
 0     2     0
 1     1     1
回答2:
You might want to look into more flexible way of evaluating clusters. For example pair counting metrics.
The "class = cluster" assumption is typical for people coming from machine learning into clustering. But instead you should assume that some classes may consist of multiple clusters, or that multiple classes actually cluster. These are the interesting situations a clustering algorithm should actually detect.
回答3:
I needed this exact thing for Python and converted the code posted by Vidar (accepted answer). I share the code to anyone interested. I renamed the variables and removed the confusion matrix (most libraries for machine learnine have built in functions for that anyway). I noticed the faster implementation linked by Vincent (http://www.mathworks.com/matlabcentral/fileexchange/32197-clustering-results-measurement) too late. Probably better adapt that one for Python.
#tested with python 3.6
def remap_labels(pred_labels, true_labels):
    """Rename prediction labels (clustered output) to best match true labels."""
    # from itertools import permutations # import this into script.
    pred_labels, true_labels = np.array(pred_labels), np.array(true_labels)
    assert pred_labels.ndim == 1 == true_labels.ndim
    assert len(pred_labels) == len(true_labels)
    cluster_names = np.unique(pred_labels)
    accuracy = 0
    perms = np.array(list(permutations(np.unique(true_labels))))
    remapped_labels = true_labels
    for perm in perms:
        flipped_labels = np.zeros(len(true_labels))
        for label_index, label in enumerate(cluster_names):
            flipped_labels[pred_labels == label] = perm[label_index]
        testAcc = np.sum(flipped_labels == true_labels) / len(true_labels)
        if testAcc > accuracy:
            accuracy = testAcc
            remapped_labels = flipped_labels
    return accuracy, remapped_labels
来源:https://stackoverflow.com/questions/11683785/how-can-i-match-up-cluster-labels-to-my-ground-truth-labels-in-matlab