Outlier detection with k-means algorithm

匿名 (未验证) 提交于 2019-12-03 02:56:01

问题:

I am hoping you can help me with my problem. I am trying to detect outliers with use of the kmeans algorithm. First I perform the algorithm and choose those objects as possible outliers which have a big distance to their cluster center. Instead of using the absolute distance I want to use the relative distance, i.e. the ration of absolute distance of the object to the cluster center and the average distance of all objects of the cluster to their cluster center. The code for outlier detection based on absolute distance is the following:

# remove species from the data to cluster iris2 <- iris[,1:4] kmeans.result <- kmeans(iris2, centers=3) # cluster centers kmeans.result$centers # calculate distances between objects and cluster centers centers <- kmeans.result$centers[kmeans.result$cluster, ] distances <- sqrt(rowSums((iris2 - centers)^2)) # pick top 5 largest distances outliers <- order(distances, decreasing=T)[1:5] # who are outliers print(outliers)

But how can I use the relative instead of the absolute distance to find outliers?

回答1:

You just need to calculate the mean distance each observation is from its cluster. You already have those distances, so you just need to average them. Then the rest is simple indexed division:

# calculate mean distances by cluster: m <- tapply(distances, kmeans.result$cluster,mean)  # divide each distance by the mean for its cluster: d <- distances/(m[kmeans.result$cluster])

Your outliers:

> d[order(d, decreasing=TRUE)][1:5]        2        3        3        1        3  2.706694 2.485078 2.462511 2.388035 2.354807


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!