k-means

Is Triangle inequality necessary for kmeans?

扶醉桌前 提交于 2019-12-03 08:33:25
I wonder if Triangle inequality is necessary for the distance measure used in kmeans. k-means is designed for Euclidean distance, which happens to satisfy triangle inequality. Using other distance functions is risky, as it may stop converging . The reason however is not the triangle inequality, but the mean might not minimize the distance function . (The arithmetic mean minimizes the sum-of-squares, not arbitrary distances!) There are faster methods for k-means that exploit the triangle inequality to avoid recomputations. But if you stick to classic MacQueen or Lloyd k-means, then you do not

Python sklearn-KMeans how to get the samples/points in each clusters

一笑奈何 提交于 2019-12-03 08:04:14
问题 I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it? Say I had 100 data points and KMeans gave me 5 cluster. Now I want to know which data points are in cluster 5. How can I do that. Is there a function to give the cluster id and it will list out all the data points in that cluster Thanks. 回答1: I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset

Will scikit-learn utilize GPU?

江枫思渺然 提交于 2019-12-03 07:27:59
问题 Reading implementation of scikit-learn in tensroflow : http://learningtensorflow.com/lesson6/ and scikit-learn : http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html I'm struggling to decide which implementation to use. scikit-learn is installed as part of the tensorflow docker container so can use either implementation. Reason to use scikit-learn : scikit-learn contains less boiler plate than the tensorflow implementation. Reason to use tensorflow : If running on

Weka simple K-means clustering assignments

∥☆過路亽.° 提交于 2019-12-03 07:24:27
问题 I have what feels like a simple problem, but I can't seem to find an answer. I'm pretty new to Weka, but I feel like I've done a bit of research on this (at least read through the first couple of pages of Google results) and come up dry. I am using Weka to run clustering using Simple K-Means. In the results list I have no problem visualizing my output ("Visualize cluster assignments") and it is clear both from my understanding of the K-Means algorithm and the output of Weka that each of my

Can I use K-means algorithm on a string?

微笑、不失礼 提交于 2019-12-03 07:12:31
问题 I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation. I was thinking of using the k-means

Should we used k-means++ instead of k-means?

不打扰是莪最后的温柔 提交于 2019-12-03 07:07:51
The k-means++ algorithm helps in two following points of the original k-means algorithm: The original k-means algorithm has the worst case running time of super-polynomial in input size, while k-means++ has claimed to be O(log k). The approximation found can yield a not so satisfactory result with respect to objective function compared to the optimal clustering. But are there any drawbacks of k-means++? Should we always used it instead of k-means from now on? Nobody claims k -means++ runs in O(lg k ) time; it's solution quality is O(lg k )-competitive with the optimal solution. Both k -means++

机器学习:聚类

天大地大妈咪最大 提交于 2019-12-03 06:53:30
聚类算法 简介 聚类就是对大量 未知标注的数据集 ,按照数据 内部存在的数据特征 将数据集划分为多个不同的类别,使类别内的数据比较相似,类别之间的数据相似度比较小; 属于无监督学习 。 聚类算法的重点是计算样本项之间的 相似度 ,有时候也称为样本间的 距离 相似度/距离 闵可夫斯基距离 \[ dist(X,Y)=\quad\sqrt[p]{\sum_{i=1}^{n}{|x_i-y_i|^p}} \] \[ 其中X=(x_1,x_2,...,x_n),Y=(y_1,y_2,...,y_n) \] 当p是1的时候为哈曼顿距离(Manhattan)/城市距离: \[ dist(X,Y)=\sum_{i=1}^{n}{|x_i-y_i|} \] 当p为2的时候为欧式距离(Euclidean): \[ E\_dist(X,Y)=\sqrt{\sum_{i=1}^{n}{|x_i-y_i|^2}} \] 当p为无穷大的时候是切比雪夫距离(Chebyshew): \[ C\_dist(X,Y)=max(|x_i-y_i|) \] 有数学性质可得,当p为无穷大时,有如下: \[ dist(X,Y)=\quad\sqrt[p]{\sum_{i=1}^{n}{|x_i-y_i|^p}} \] \[ \leq\quad\sqrt[p]{n\times max(|x_i-y_i|^p)}=max(|x

Kmeans matlab “Empty cluster created at iteration 1” error

半腔热情 提交于 2019-12-03 06:27:12
I'm using this script to cluster a set of 3D points using the kmeans matlab function but I always get this error "Empty cluster created at iteration 1". The script I'm using: [G,C] = kmeans(XX, K, 'distance','sqEuclidean', 'start','sample'); XX can be found in this link XX value and the K is set to 3 So if anyone could please advise me why this is happening. Amro It is simply telling you that during the assign-recompute iterations, a cluster became empty (lost all assigned points). This is usually caused by an inadequate cluster initialization, or that the data has less inherent clusters than

How to compute distances between centroids and data matrix (for kmeans algorithm)

心已入冬 提交于 2019-12-03 06:13:45
问题 I am a student of clustering and R. In order to obtain a better grip of both I would like to compute the distance between centroids and my xy-matrix for each iteration till it "converges". How can I solve for step 2 and 3 using R? library(fields) x <- c(3,6,8,1,2,2,6,6,7,7,8,8) y <- c(5,2,3,5,4,6,1,8,3,6,1,7) df <- data.frame(x,y) initial matrix a <- c(3,6,8) b <- c(5,2,3) df1 <- data.frame(a,b) # initial centroids Here is what I want to do: I0 <- t(rdist(df, df1)) after zero iteration

Python Clustering Algorithms

时光毁灭记忆、已成空白 提交于 2019-12-03 05:42:18
I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question ). I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles,