How to assign an new observation to existing Kmeans clusters based on nearest cluster centriod logic in python?

时光怂恿深爱的人放手 提交于 2019-12-30 11:17:08

问题


I used the below code to create k-means clusters using Scikit learn.

kmean = KMeans(n_clusters=nclusters,n_jobs=-1,random_state=2376,max_iter=1000,n_init=1000,algorithm='full',init='k-means++')

kmean_fit = kmean.fit(clus_data)

I also have saved the centroids using kmean_fit.cluster_centers_

I then pickled the K means object.

filename = pickle_path+'\\'+'_kmean_fit.sav'
pickle.dump(kmean_fit, open(filename, 'wb'))

So that I can load the same kmeans pickle object and apply it to new data when it comes, using kmean_fit.predict().

Questions :

  1. Will the approach of loading kmeans pickle object and applying kmean_fit.predict() allow me to assign the new observation to existing clusters based on centroid of the existing clusters? Does this approach just recluster from scratch on the new data?

  2. If this method wont work how to assign the new observation to existing clusters given that I already have saved the cluster centriods using efficent python code?

PS: I know building a classifer using existing clusters as dependent variable is another way but I dont want to do that because of time crunch.


回答1:


Yes. Whether the sklearn.cluster.KMeans object is pickled or not (if you un-pickle it correctly, you'll be dealing with the "same" original object) does not affect that you can use the predict method to cluster a new observation.

An example:

from sklearn.cluster import KMeans
from sklearn.externals import joblib

model = KMeans(n_clusters = 2, random_state = 100)
X = [[0,0,1,0], [1,0,0,1], [0,0,0,1],[1,1,1,0],[0,0,0,0]]
model.fit(X)

Out:

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=100, tol=0.0001,
    verbose=0)

Continue:

joblib.dump(model, 'model.pkl')  
model_loaded = joblib.load('model.pkl')

model_loaded

Out:

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=100, tol=0.0001,
    verbose=0)

See how the n_clusters and random_state parameters are the same between the model and model_new objects? You're good to go.

Predict with the "new" model:

model_loaded.predict([0,0,0,0])

Out[64]: array([0])


来源:https://stackoverflow.com/questions/43257975/how-to-assign-an-new-observation-to-existing-kmeans-clusters-based-on-nearest-cl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!