I have 2,000,000 points in 100 dimensionality space. How can I cluster them to K (e.g., 1000) clusters?

旧街凉风 提交于 2019-12-06 08:15:59

问题


The problem comes as follows. I have M images and extract N features for each image, and the dimensionality of each feature is L. Thus, I have M*N features (2,000,000 for my case) and each feature has L dimensionality (100 for my case). I need to cluster these M*N features into K clusters. How can I do it? Thanks.


回答1:


Do you want 1000 clusters of images, or of features, or of (image, feature) pairs ?
In any case, it sounds as though you'll have to reduce the data and use simpler methods.

One possibility is two-pass K-cluster:
a) split the 2 million data points into 32 clusters,
b) split each of these into 32 more.
If this works, the resulting 32^2 = 1024 clusters might be good enough for your purpose.

Then, do you really need 100 coordinates ? Could you guess the 20 most important ones, or just try random subsets of 20 ?

There's a huge literature: Google +image "dimension reduction" gives ~ 70000 hits.




回答2:


You've tagged the question "k-means". Why can't you use k-means? Is this a question of efficiency? (personally I've only used k-means in 2 dimensions) Or is it a question of how to encode the k-means algorithm?

Are your values discrete (eg. categories) or continuous (eg. a coordinate value)? If the latter, then k-means should be fine in my understanding. For the clustering of discrete values then a different algorithm will be required - perhaps hierarchical clustering?




回答3:


The EM-tree and K-tree algorithms in the LMW-tree project can cluster problems this big and larger. Our most recent result is clustering 733 million web pages into 600,000 clusters. There is also a streaming variant of the EM-tree where the dataset is streamed from disk for each iteration.




回答4:


A good trick when clustering millions of points is to sample them, cluster the sample, and then add the remaining points to the existing sample



来源:https://stackoverflow.com/questions/4153981/i-have-2-000-000-points-in-100-dimensionality-space-how-can-i-cluster-them-to-k

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!