Matlab k-means cosine assigns everything to one cluster

笑着哭i 提交于 2019-12-11 11:09:14

问题


I'm using Matlab's regular kmeans algorithm with 'Distance','cosine','EmptyAction','drop' on an L2-normalized feature matrix and I have a problem. The output that Matlab generates is simply assigning EVERY datapoint to cluster 1.00000, even if k=20, and all centroids in C are NaN. Does anyone have any suggestions as to what might be causing this?

The layout of the matrix is ([0,1,...,1,0,1],[...],[0,1,...,1,0,1]). I've done the L2-normalization using Python's numpy.linalg.norm before I passed the file to Matlab. This is the exact way I am running kmeans:

m=importdata('matrix.txt');
data=m'; % transpose, because kmeans treats columns as features instead of rows
[L, C]=kmeans(data, 20, 'Distance', 'cosine', 'EmptyAction', 'drop')

Here is a sample of my normalized dataset:

10.3440804328
12.6885775404
15.5884572681
15.9059737206
17.4355957742
17.0
17.3493515729
17.3205080757
18.6279360102
19.7230829233
21.400934559
22.0
22.5831795813
23.0
24.0416305603
25.2388589282
26.8141753556
22.5388553392
9.2736184955
13.5277492585
15.2970585408

Any help or suggestions would be greatly appreciated. If you need more information let me know!


回答1:


It is the cosine distance that is making it fail, it works with sqEuclidean. I think the cosine distance needs more info, or else doesn't make sense on your data set.

Edit: I will agree with you that the documentation is a little vague here...but the definition of cosine distance in the pdist function of Matlab is: "One minus the cosine of the included angle between points (treated as vectors)."

I take it from that, that the angle must be included(I am assuming in the next column). But that kind of seems like it defeats the purpose.cosine similarity Edit again: I guess it is more likely that included means "the included angle between 2 vectors". In this case I think cosine expects 2 or more columns to work on.

Also, if your already into python there are some good machine learning tools there as well. Here is one I have used. There is also MILK, but I have never used it myself.



来源:https://stackoverflow.com/questions/10503193/matlab-k-means-cosine-assigns-everything-to-one-cluster

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!