Evaluating K-means accuracy

后端 未结 2 1612
猫巷女王i
猫巷女王i 2020-12-10 00:10

I created a 3-dimensional random data sets with 4 defined patterns/classes in MATLAB. I applied the K-means algorithm on the data to see how well K-means can classify my sa

2条回答
  •  既然无缘
    2020-12-10 00:51

    In addition to purity scores, consider using the following clustering metrics: Normalized Mutual Information (NMI), Variation of Information (VI) and Adjusted Rand Index (ARI). Given the predicted label assignments X and the ground truth labels Y, the NMI is defined as:

    NMI(X;Y) = I(X;Y) / ((H(X)+H(Y))/2
    

    where H(X) is the entropy and I(X;Y) is the mutual information. As the overlap between X and Y increases the NMI approaches 1. See Matlab implementation here. Variation of Information is defined as:

    VI(X;Y) = H(X)+H(Y)-2I(X;Y) = H(X|Y) + H(Y|X)
    

    Thus, VI decreases as the overlap between label assignments X and Y increases. See Matlab implementation here. Finally, the adjusted Rand index is defined as:

    ARI = RI-E[RI] / (max RI - E[RI])
    RI = TP + TN / (TP + FP + FN + TN)
    

    Thus, ARI approaches 1 for cluster assignments that are similar to each other. See Python implementation here.

    If you are interested in choosing the number of clusters K automatically based on data, consider using Dirichlet Process (DP) K-means. See paper and code for more information.

提交回复
热议问题