I have 2,000,000 points in 100 dimensionality space. How can I cluster them to K (e.g., 1000) clusters?

旧巷老猫 提交于 2019-12-04 12:22:02

Do you want 1000 clusters of images, or of features, or of (image, feature) pairs ?
In any case, it sounds as though you'll have to reduce the data and use simpler methods.

One possibility is two-pass K-cluster:
a) split the 2 million data points into 32 clusters,
b) split each of these into 32 more.
If this works, the resulting 32^2 = 1024 clusters might be good enough for your purpose.

Then, do you really need 100 coordinates ? Could you guess the 20 most important ones, or just try random subsets of 20 ?

There's a huge literature: Google +image "dimension reduction" gives ~ 70000 hits.

You've tagged the question "k-means". Why can't you use k-means? Is this a question of efficiency? (personally I've only used k-means in 2 dimensions) Or is it a question of how to encode the k-means algorithm?

Are your values discrete (eg. categories) or continuous (eg. a coordinate value)? If the latter, then k-means should be fine in my understanding. For the clustering of discrete values then a different algorithm will be required - perhaps hierarchical clustering?

The EM-tree and K-tree algorithms in the LMW-tree project can cluster problems this big and larger. Our most recent result is clustering 733 million web pages into 600,000 clusters. There is also a streaming variant of the EM-tree where the dataset is streamed from disk for each iteration.

A good trick when clustering millions of points is to sample them, cluster the sample, and then add the remaining points to the existing sample

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!