I created a 3-dimensional random data sets with 4 defined patterns/classes in MATLAB. I applied the K-means algorithm on the data to see how well K-means can classify my sa
In addition to purity scores, consider using the following clustering metrics: Normalized Mutual Information (NMI), Variation of Information (VI) and Adjusted Rand Index (ARI). Given the predicted label assignments X and the ground truth labels Y, the NMI is defined as:
NMI(X;Y) = I(X;Y) / ((H(X)+H(Y))/2
where H(X) is the entropy and I(X;Y) is the mutual information. As the overlap between X and Y increases the NMI approaches 1. See Matlab implementation here. Variation of Information is defined as:
VI(X;Y) = H(X)+H(Y)-2I(X;Y) = H(X|Y) + H(Y|X)
Thus, VI decreases as the overlap between label assignments X and Y increases. See Matlab implementation here. Finally, the adjusted Rand index is defined as:
ARI = RI-E[RI] / (max RI - E[RI])
RI = TP + TN / (TP + FP + FN + TN)
Thus, ARI approaches 1 for cluster assignments that are similar to each other. See Python implementation here.
If you are interested in choosing the number of clusters K automatically based on data, consider using Dirichlet Process (DP) K-means. See paper and code for more information.