问题
I have built a segmentation model using k-means clustering.
Could anybody describe the process for assigning new data into these segments?
Currently I am applying the same transformations/standardisations/outliers as I did to build the model and then calculating the euclidean distance. The minimum distance is the segment that record falls into.
But, I am seeing the majority fall into 1 particular segment and I am wondering if I have missed something along the way?
Thanks
回答1:
Classifying a new observation based on euclidean distance to the nearest mean may work for some scenarios, however it ignores the shape/size of the original cluster.
One way around this would be to use the original cluster data to help classify each new observation (e.g., using KNN http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
As an alternative, you might consider using an alternative clustering technique, such as Mixture of Gaussians:
http://en.wikipedia.org/wiki/Mixture_model
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/mixture.html
Using this, you will not only get a mean for each cluster, but also a variance. For each new observation, you can then compute the probability that it belongs to each cluster. That probability will take the original cluster size/shape into account. It's also nicer to work with type type of "soft" approach because it tells you how strongly each new observation belongs to each cluster, and you can do things like tag observations as outliers that are greater than some number of standard deviations away from all clusters.
来源:https://stackoverflow.com/questions/18131173/how-to-segment-new-data-with-existing-k-means-model