问题
I'm using Matlab's regular kmeans algorithm with 'Distance','cosine','EmptyAction','drop' on an L2-normalized feature matrix and I have a problem. The output that Matlab generates is simply assigning EVERY datapoint to cluster 1.00000
, even if k=20, and all centroids in C are NaN
. Does anyone have any suggestions as to what might be causing this?
The layout of the matrix is ([0,1,...,1,0,1],[...],[0,1,...,1,0,1]). I've done the L2-normalization using Python's numpy.linalg.norm
before I passed the file to Matlab. This is the exact way I am running kmeans:
m=importdata('matrix.txt');
data=m'; % transpose, because kmeans treats columns as features instead of rows
[L, C]=kmeans(data, 20, 'Distance', 'cosine', 'EmptyAction', 'drop')
Here is a sample of my normalized dataset:
10.3440804328
12.6885775404
15.5884572681
15.9059737206
17.4355957742
17.0
17.3493515729
17.3205080757
18.6279360102
19.7230829233
21.400934559
22.0
22.5831795813
23.0
24.0416305603
25.2388589282
26.8141753556
22.5388553392
9.2736184955
13.5277492585
15.2970585408
Any help or suggestions would be greatly appreciated. If you need more information let me know!
回答1:
It is the cosine distance that is making it fail, it works with sqEuclidean. I think the cosine distance needs more info, or else doesn't make sense on your data set.
Edit: I will agree with you that the documentation is a little vague here...but the definition of cosine distance in the pdist function of Matlab is: "One minus the cosine of the included angle between points (treated as vectors)."
I take it from that, that the angle must be included(I am assuming in the next column). But that kind of seems like it defeats the purpose.cosine similarity Edit again: I guess it is more likely that included means "the included angle between 2 vectors". In this case I think cosine expects 2 or more columns to work on.
Also, if your already into python there are some good machine learning tools there as well. Here is one I have used. There is also MILK, but I have never used it myself.
来源:https://stackoverflow.com/questions/10503193/matlab-k-means-cosine-assigns-everything-to-one-cluster