Return the furthermost outlier in kmeans clustering? [closed]

问题

Is there any easy way to return the furthermost outlier after sklearn kmeans clustering?

Essentially I want to make a list of the biggest outliers for a load of clusters. Unfortunately I need to use sklearn.cluster.KMeans due to the assignment.

回答1:

Sascha basically gives it away in the comments, but if X denotes your data, and model the instance of KMeans, you can sort the values of X by the distance to their centers through

X[np.argsort(np.linalg.norm(X - model.cluster_centers_[model.labels_], axis=1))]

Alternatively, since you know that each point is assigned to the cluster whose center minimizes Euclidean distance to the point, you can fit and sort in one step through

X[np.argsort(np.min(KMeans(n_clusters=2).fit_transform(X), axis=1))]

回答2:

K-means is not well suited for "outlier" detection.

k-means has a tendency to make outliers a one-element cluster. Then the outliers have the smallest possible distance and will not be detected.

K-means is not robust enough when there are outliers in your data. You may actually want to remove outliers prior to using k-means.

Use rather something like kNN, LOF or LoOP instead.

来源：https://stackoverflow.com/questions/47489705/return-the-furthermost-outlier-in-kmeans-clustering

标签

python

scikit-learn

cluster-analysis