问题
I've been using scipy's k-means for quite some time now, and I'm pretty happy about the way it works in terms of usability and efficiency. However, now I want to explore different k-means variants, more specifically, I'd like to apply spherical k-means in some of my problems.
Do you know any good Python implementation (i.e. similar to scipy's k-means) of spherical k-means? If not, how hard would it be to modify scipy's source code to adapt its k-means algorithm to be spherical?
Thank you.
回答1:
In spherical k-means, you aim to guarantee that the centers are on the sphere, so you could adjust the algorithm to use the cosine distance, and should additionally normalize the centroids of the final result.
When using the Euclidean distance, I prefer to think of the algorithm as projecting the cluster centers onto the unit sphere in each iteration, i.e., the centers should be normalized after each maximization step.
Indeed, when the centers and data points are both normalized, there is a 1-to-1 relationship between the cosine distance and Euclidean distance
|a - b|_2 = 2 * (1 - cos(a,b))
The package jasonlaska/spherecluster modifies scikit-learns's k-means
into spherical k-means
and also provides another sphere clustering algorithm.
回答2:
It looks like the salient feature in the spherical k-means is the use of the cosine distance, instead of the standard Euclidean metric. With that being said, there is a nice pure numpy/scipy adaptation here on SO in another answer:
Is it possible to specify your own distance function using Scikits.Learn K-Means Clustering?
If that doesn't meet what you are looking for you might want to try sklearn.cluster.
回答3:
Here's how you do it if you have polar coordinates on a 3D sphere, such as (lat
, lon
) pairs:
If your coordinates are (
lat
,lon
) coordinates measured in degrees you can write a function that converts these points into cartesian coordinates, like:def cartesian_encoder(coord, r_E=6371): """Convert lat/lon to cartesian points on Earth's surface. Input ----- coord : numpy 2darray (size=(N, 2)) r_E : radius of Earth Output ------ out : numpy 2darray (size=(N, 3)) """ def _to_rad(deg): return deg * np.pi / 180. theta = _to_rad(coord[:, 0]) # lat [radians] phi = _to_rad(coord[:, 1]) # lon [radians] x = r_E * np.cos(phi) * np.cos(theta) y = r_E * np.sin(phi) * np.cos(theta) z = r_E * np.sin(theta) return np.concatenate([x.reshape(-1, 1), y.reshape(-1, 1), z.reshape(-1, 1)], axis=1)
If your coordinates are already in radians, just remove the first 5 lines in that function.
Install the
spherecluster
package with pip. If your polar data given as rows of (lat
,lon
) pairs is calledX
and you want to find 10 cluster in it, the final code for KMeans-clustering spherically will be:import numpy as np import spherecluster X_cart = cartesian_encoder(X) kmeans_labels = SphericalKMeans(10).fit_predict(X_cart)
来源:https://stackoverflow.com/questions/19226925/spherical-k-means-implementation-in-python