问题
I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it?
Say I had 100 data points and KMeans gave me 5 cluster. Now I want to know which data points are in cluster 5. How can I do that.
Is there a function to give the cluster id and it will list out all the data points in that cluster
Thanks.
回答1:
I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset and the labels as columns.
data = pd.read_csv('filename')
km = KMeans(n_clusters=5).fit(data)
cluster_map = pd.DataFrame()
cluster_map['data_index'] = data.index.values
cluster_map['cluster'] = km.labels_
Once the DataFrame is available is quite easy to filter, For example, to filter all data points in cluster 3
cluster_map[cluster_map.cluster == 3]
回答2:
If you have a large dataset and you need to extract clusters on-demand you'll see some speed-up using numpy.where. Here is an example on the iris dataset:
from sklearn.cluster import KMeans
from sklearn import datasets
import numpy as np
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target
km = KMeans(n_clusters=3)
km.fit(X)
Define a function to extract the indices of the cluster_id you provide. (Here are two functions, for benchmarking, they both return the same values):
def ClusterIndicesNumpy(clustNum, labels_array): #numpy
return np.where(labels_array == clustNum)[0]
def ClusterIndicesComp(clustNum, labels_array): #list comprehension
return np.array([i for i, x in enumerate(labels_array) if x == clustNum])
Let's say you want all samples that are in cluster 2
:
ClusterIndicesNumpy(2, km.labels_)
array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])
Numpy wins the benchmark:
%timeit ClusterIndicesNumpy(2,km.labels_)
100000 loops, best of 3: 4 µs per loop
%timeit ClusterIndicesComp(2,km.labels_)
1000 loops, best of 3: 479 µs per loop
Now you can extract all of your cluster 2 data points like so:
X[ClusterIndicesNumpy(2,km.labels_)]
array([[ 6.9, 3.1, 4.9, 1.5],
[ 6.7, 3. , 5. , 1.7],
[ 6.3, 3.3, 6. , 2.5],
... #truncated
Double-check the first three indices from the truncated array above:
print X[52], km.labels_[52]
print X[77], km.labels_[77]
print X[100], km.labels_[100]
[ 6.9 3.1 4.9 1.5] 2
[ 6.7 3. 5. 1.7] 2
[ 6.3 3.3 6. 2.5] 2
回答3:
You can look at attribute labels_
For example
km = KMeans(2)
km.fit([[1,2,3],[2,3,4],[5,6,7]])
print km.labels_
output: array([1, 1, 0], dtype=int32)
As you can see first and second point is cluster 1
, last point in cluster 0
.
回答4:
To get the IDs of the points/samples/observations that are inside each cluster, do this:
Example using Iris data and a nice pythonic way:
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
np.random.seed(0)
# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# KMeans with 3 clusters
clf = KMeans(n_clusters=3)
clf.fit(X,y)
#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_
#Labels of each point
clf.labels_
# Nice Pythonic way to get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}
# Transform this dictionary into list (if you need a list as result)
dictlist = []
for key, value in mydict.iteritems():
temp = [key,value]
dictlist.append(temp)
RESULTS
#dict format
{0: array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149]),
1: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
2: array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])}
# list format
[[0, array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
[1, array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
[2, array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]
回答5:
You can Simply store the labels in an array. Convert the array to a data frame. Then Merge the data that you used to create K means with the new data frame with clusters.
Display the dataframe. Now you should see the row with corresponding cluster. If you want to list all the data with specific cluster, use something like data.loc[data['cluster_label_name'] == 2], assuming 2 your cluster for now.
来源:https://stackoverflow.com/questions/36195457/python-sklearn-kmeans-how-to-get-the-samples-points-in-each-clusters