I have 2,000,000 points in 100 dimensionality space. How can I cluster them to K (e.g., 1000) clusters?

The problem comes as follows. I have M images and extract N features for each image, and the dimensionality of each feature is L. Thus, I have M*N features (2,000,000 for my case) and each feature has L dimensionality (100 for my case). I need to cluster these M*N features into K clusters. How can I do it? Thanks.

Do you want 1000 clusters of images, or of features, or of (image, feature) pairs ?
In any case, it sounds as though you'll have to reduce the data and use simpler methods.

One possibility is two-pass K-cluster:
a) split the 2 million data points into 32 clusters,
b) split each of these into 32 more.
If this works, the resulting 32^2 = 1024 clusters might be good enough for your purpose.

Then, do you really need 100 coordinates ? Could you guess the 20 most important ones, or just try random subsets of 20 ?

There's a huge literature: Google +image "dimension reduction" gives ~ 70000 hits.

You've tagged the question "k-means". Why can't you use k-means? Is this a question of efficiency? (personally I've only used k-means in 2 dimensions) Or is it a question of how to encode the k-means algorithm?

Are your values discrete (eg. categories) or continuous (eg. a coordinate value)? If the latter, then k-means should be fine in my understanding. For the clustering of discrete values then a different algorithm will be required - perhaps hierarchical clustering?

The EM-tree and K-tree algorithms in the LMW-tree project can cluster problems this big and larger. Our most recent result is clustering 733 million web pages into 600,000 clusters. There is also a streaming variant of the EM-tree where the dataset is streamed from disk for each iteration.

A good trick when clustering millions of points is to sample them, cluster the sample, and then add the remaining points to the existing sample

来源：https://stackoverflow.com/questions/4153981/i-have-2-000-000-points-in-100-dimensionality-space-how-can-i-cluster-them-to-k

标签

cluster-analysis

k-means