partially define initial centroid for scikit-learn K-Means clustering

问题

Scikit documentation states that:

Method for initialization:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

My data has 10 (predicted) clusters and 7 features. However, I would like to pass array of 10 by 6 shape, i.e. I want 6 dimensions of centroid of be predefined by me, but 7th dimension to be iterated freely using k-mean++.(In another word, I do not want to specify initial centroid, but rather control 6 dimension and only leave one dimension to vary for initial cluster)

I tried to pass 10x6 dimension, in hope it would work, but it just throw up the error.

回答1:

Sklearn does not allow you to perform this kind of fine operations.

The only possibility is to provide a 7th feature value that is random, or similar to what Kmeans++ would have achieved.

So basically you can estimate a good value for this as follows:

import numpy as np
from sklearn.cluster import KMeans

nb_clust = 10
# your data
X = np.random.randn(7*1000).reshape( (1000,7) )   

# your 6col centroids  
cent_6cols = np.random.randn(6*nb_clust).reshape( (nb_clust,6) ) 

# artificially fix your centroids
km = KMeans( n_clusters=10 )
km.cluster_centers_ = cent_6cols

# find the points laying on each cluster given your initialization
initial_prediction = km.predict(X[:,0:6])

# For the 7th column you'll provide the average value 
# of the points laying on the cluster given by your partial centroids    
cent_7cols = np.zeros( (nb_clust,7) )
cent_7cols[:,0:6] = cent_6cols
for i in range(nb_clust):
    init_7th = X[ np.where( initial_prediction == i ), 6].mean()
    cent_7cols[i,6] =  init_7th

# now you have initialized the 7th column with a Kmeans ++ alike 
# So now you can use the cent_7cols as your centroids
truekm = KMeans( n_clusters=10, init=cent_7cols )

回答2:

That is a very nonstandard variation of k-means. So you cannot expect sklearn to be prepared for every exotic variation. That would make sklearn slower for everybody else.

In fact, your approach is more like certain regression approaches (predicting the last value of the cluster centers) rather than clustering. I also doubt the results will be much better than simply setting the last value to the average of all points assigned to the cluster center using the other 6 dimensions only. Try partitioning your data based on the nearest center (ignoring the last column) and then setting the last column to be the arithmetic mean of the assigned data.

However, sklearn is open source.

So get the source code, and modify k-means. Initialize the last component randomly, and while running k-means only update the last column. It's easy to modify it this way - but it's very hard to design an efficient API to allow such customizations through trivial parameters - use the source code to customize at his level.

来源：https://stackoverflow.com/questions/53045859/partially-define-initial-centroid-for-scikit-learn-k-means-clustering

标签

python

machine-learning

scikit-learn

cluster-analysis

k-means