问题
Is there any easy way to return the furthermost outlier after sklearn kmeans clustering?
Essentially I want to make a list of the biggest outliers for a load of clusters. Unfortunately I need to use sklearn.cluster.KMeans due to the assignment.
回答1:
Sascha basically gives it away in the comments, but if X
denotes your data, and model
the instance of KMeans
, you can sort the values of X
by the distance to their centers through
X[np.argsort(np.linalg.norm(X - model.cluster_centers_[model.labels_], axis=1))]
Alternatively, since you know that each point is assigned to the cluster whose center minimizes Euclidean distance to the point, you can fit and sort in one step through
X[np.argsort(np.min(KMeans(n_clusters=2).fit_transform(X), axis=1))]
回答2:
K-means is not well suited for "outlier" detection.
k-means has a tendency to make outliers a one-element cluster. Then the outliers have the smallest possible distance and will not be detected.
K-means is not robust enough when there are outliers in your data. You may actually want to remove outliers prior to using k-means.
Use rather something like kNN, LOF or LoOP instead.
来源:https://stackoverflow.com/questions/47489705/return-the-furthermost-outlier-in-kmeans-clustering