k-means | 易学教程

Is Triangle inequality necessary for kmeans?

阅读更多关于 Is Triangle inequality necessary for kmeans?

I wonder if Triangle inequality is necessary for the distance measure used in kmeans. k-means is designed for Euclidean distance, which happens to satisfy triangle inequality. Using other distance functions is risky, as it may stop converging . The reason however is not the triangle inequality, but the mean might not minimize the distance function . (The arithmetic mean minimizes the sum-of-squares, not arbitrary distances!) There are faster methods for k-means that exploit the triangle inequality to avoid recomputations. But if you stick to classic MacQueen or Lloyd k-means, then you do not

Python sklearn-KMeans how to get the samples/points in each clusters

阅读更多关于 Python sklearn-KMeans how to get the samples/points in each clusters

问题 I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it? Say I had 100 data points and KMeans gave me 5 cluster. Now I want to know which data points are in cluster 5. How can I do that. Is there a function to give the cluster id and it will list out all the data points in that cluster Thanks. 回答1: I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset

Will scikit-learn utilize GPU?

阅读更多关于 Will scikit-learn utilize GPU?

问题 Reading implementation of scikit-learn in tensroflow : http://learningtensorflow.com/lesson6/ and scikit-learn : http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html I'm struggling to decide which implementation to use. scikit-learn is installed as part of the tensorflow docker container so can use either implementation. Reason to use scikit-learn : scikit-learn contains less boiler plate than the tensorflow implementation. Reason to use tensorflow : If running on

Weka simple K-means clustering assignments

阅读更多关于 Weka simple K-means clustering assignments

问题 I have what feels like a simple problem, but I can't seem to find an answer. I'm pretty new to Weka, but I feel like I've done a bit of research on this (at least read through the first couple of pages of Google results) and come up dry. I am using Weka to run clustering using Simple K-Means. In the results list I have no problem visualizing my output ("Visualize cluster assignments") and it is clear both from my understanding of the K-Means algorithm and the output of Weka that each of my

Can I use K-means algorithm on a string?

阅读更多关于 Can I use K-means algorithm on a string?

问题 I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation. I was thinking of using the k-means

Should we used k-means++ instead of k-means?

阅读更多关于 Should we used k-means++ instead of k-means?

The k-means++ algorithm helps in two following points of the original k-means algorithm: The original k-means algorithm has the worst case running time of super-polynomial in input size, while k-means++ has claimed to be O(log k). The approximation found can yield a not so satisfactory result with respect to objective function compared to the optimal clustering. But are there any drawbacks of k-means++? Should we always used it instead of k-means from now on? Nobody claims k -means++ runs in O(lg k ) time; it's solution quality is O(lg k )-competitive with the optimal solution. Both k -means++

机器学习：聚类

阅读更多关于机器学习：聚类

聚类算法简介聚类就是对大量未知标注的数据集，按照数据内部存在的数据特征将数据集划分为多个不同的类别，使类别内的数据比较相似，类别之间的数据相似度比较小；属于无监督学习。聚类算法的重点是计算样本项之间的相似度，有时候也称为样本间的距离相似度/距离闵可夫斯基距离 \[ dist(X,Y)=\quad\sqrt[p]{\sum_{i=1}^{n}{|x_i-y_i|^p}} \] \[ 其中X=(x_1,x_2,...,x_n),Y=(y_1,y_2,...,y_n) \] 当p是1的时候为哈曼顿距离(Manhattan)/城市距离： \[ dist(X,Y)=\sum_{i=1}^{n}{|x_i-y_i|} \] 当p为2的时候为欧式距离(Euclidean)： \[ E\_dist(X,Y)=\sqrt{\sum_{i=1}^{n}{|x_i-y_i|^2}} \] 当p为无穷大的时候是切比雪夫距离(Chebyshew)： \[ C\_dist(X,Y)=max(|x_i-y_i|) \] 有数学性质可得，当p为无穷大时，有如下： \[ dist(X,Y)=\quad\sqrt[p]{\sum_{i=1}^{n}{|x_i-y_i|^p}} \] \[ \leq\quad\sqrt[p]{n\times max(|x_i-y_i|^p)}=max(|x

Kmeans matlab “Empty cluster created at iteration 1” error

阅读更多关于 Kmeans matlab “Empty cluster created at iteration 1” error

I'm using this script to cluster a set of 3D points using the kmeans matlab function but I always get this error "Empty cluster created at iteration 1". The script I'm using: [G,C] = kmeans(XX, K, 'distance','sqEuclidean', 'start','sample'); XX can be found in this link XX value and the K is set to 3 So if anyone could please advise me why this is happening. Amro It is simply telling you that during the assign-recompute iterations, a cluster became empty (lost all assigned points). This is usually caused by an inadequate cluster initialization, or that the data has less inherent clusters than

How to compute distances between centroids and data matrix (for kmeans algorithm)

阅读更多关于 How to compute distances between centroids and data matrix (for kmeans algorithm)

问题 I am a student of clustering and R. In order to obtain a better grip of both I would like to compute the distance between centroids and my xy-matrix for each iteration till it "converges". How can I solve for step 2 and 3 using R? library(fields) x <- c(3,6,8,1,2,2,6,6,7,7,8,8) y <- c(5,2,3,5,4,6,1,8,3,6,1,7) df <- data.frame(x,y) initial matrix a <- c(3,6,8) b <- c(5,2,3) df1 <- data.frame(a,b) # initial centroids Here is what I want to do: I0 <- t(rdist(df, df1)) after zero iteration

Python Clustering Algorithms

阅读更多关于 Python Clustering Algorithms

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question ). I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles,