k-means | 易学教程

How do I obtain individual centroids of K mean cluster using nltk (python)

阅读更多关于 How do I obtain individual centroids of K mean cluster using nltk (python)

问题 I have used nltk to perform k mean clustering as I would like to change the distance metrics to cosine distance. However, how do I obtain the centroids of all the clusters? kclusterer = KMeansClusterer(8, distance = nltk.cluster.util.cosine_distance, repeats = 1) predict = kclusterer.cluster(features, assign_clusters = True) centroids = kclusterer._centroid df_clustering['cluster'] = predict #df_clustering['centroid'] = centroids[df_clustering['cluster'] - 1].tolist() df_clustering['centroid'

k-means算法

阅读更多关于 k-means算法

k-means算法为常见的聚类算法。算法的大致思路为：首先指定要分多少类（k），然后指定k个类中心点的初始坐标。线性扫描所有的点，计算其至各中心点的距离，选择最小距离使自己归于此类。然后更新各个类中心点的坐标。迭代上述步骤直至达到收敛条件。伪码如下：但是，k初始以及中心点坐标初始值的随机选定可能会对结果造成影响。 k初始值我们可以采用手肘法：（图片取自知乎@是泽哥啊）取每个点至其所在类中心点的总距离，然后取其拐点（3）。还可以采用蒙特卡洛模拟，自动取得k值。解决随机选取中心点的坐标的问题可以采用**k-means++**算法。但是，每次都要线性扫描所有的比较耗时。为了提高效率，我们可以采用 kd树。 kd树即在n维空间（坐标为n维时），以垂直于坐标轴的方法进行空间切分。且切分位置为其坐标的中位数，如下图所示： kd树为二叉树，初始时，二叉树的根节点为整个坐标空间，然后左右子树为被其切分的两个区域，重复，直到切分到区域中没有坐标点为止。来源： CSDN 作者： JLUspring 链接： https://blog.csdn.net/qq_37724465/article/details/103834700

initial centroids for scikit-learn kmeans clustering

阅读更多关于 initial centroids for scikit-learn kmeans clustering

问题 if I already have a numpy array that can serve as the initial centroids, how can I properly initialize the kmeans algorithm? I am using the scikit-learn Kmeans class this post (k-means with selected initial centers) indicates that I only need to set n_init=1 if I am using a numpy array as the initial centroids but I am not sure if my initialization is working properly Naftali Harris' excellent visualization page shows what I am trying to do http://www.naftaliharris.com/blog/visualizing-k

Microsoft SQL and R, stored procedure and k-means

阅读更多关于 Microsoft SQL and R, stored procedure and k-means

问题 I am new here, hope to help and be helped. However, I am working on the new Microsoft Sql Server Management Studio (2016), using its new features that imply the integration with R. First of all, my goal is to create a stored procedure that perform a K-Means clustering with x and y column. The problem is that I am stuck in the middle, because I am not able to decline the online documentation to my case. Here the script CREATE TABLE [dbo].[ModelTable] ( column_name1 varchar(8000) ) ; CREATE

Reveal k-modes cluster features

阅读更多关于 Reveal k-modes cluster features

问题 I'm performing a cluster analysis on categorical data, hence using k-modes approach. My data is shaped as a preference survey: How do you like hair and eyes? The respondent can pick up an answers from a fixed (multiple choice) set of 4 possibility. I therefore get the dummies, apply k-modes, attach the clusters back to the initial df and then plot them in 2D with pca. My code looks like: import numpy as np import pandas as pd from kmodes import kmodes df_dummy = pd.get_dummies(df) #transform

How Could One Implement the K-Means++ Algorithm?

阅读更多关于 How Could One Implement the K-Means++ Algorithm?

问题 I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm. Is the probability function used based on distance or Gaussian? In the same time the most long distant point (From the other centroids) is picked for a new centroid. I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well

How Could One Implement the K-Means++ Algorithm?

阅读更多关于 How Could One Implement the K-Means++ Algorithm?

K-means聚类算法原理及c++实现

阅读更多关于 K-means聚类算法原理及c++实现

聚类是指根据数据本身的特征对数据进行分类，不需要人工标注，是无监督学习的一种。k-means算法是聚类算法中最简单的算法之一。 k-means 算法将n个数据对象划分为 k个聚类以便使得所获得的聚类满足：同一聚类中的对象相似度较高；而不同聚类中的对象相似度较小。聚类相似度是利用各聚类中对象的均值所获得一个“中心对象”（引力中心）来进行计算的。基于这样一个假设，我们再来导出k-means所要优化的目标函数：设我们一共有N个数据点需要分为K个cluster，而k-means要做的就是要最小化这个目标函数为第k个类聚中心，当第n个数据属于第k类时为1，否则为0。过程如下： 1.首先从n个数据对象任意选择 k 个对象作为初始聚类中心；而对于所剩下其它对象，则根据它们与这些聚类中心的相似度（距离），分别将它们分配给与其最相似的（聚类中心所代表的）聚类； 2.然后再计算每个所获新聚类的聚类中心（该聚类中所有对象的均值）；不断重复这一过程直到标准测度函数开始收敛为止。一般都采用均方差作为标准测度函数，k个聚类具有以下特点：各聚类本身尽可能的紧凑，而各聚类之间尽可能的分开。每一次更新聚类中心都会使目标函数减小，因此迭代最终J会达到一个极小值，不能保证是全局最小值。k-means对于噪声十分敏感。 c++实现： class ClusterMethod { private:

KMeans|| for sentiment analysis on Spark

阅读更多关于 KMeans|| for sentiment analysis on Spark

问题 I'm trying to write sentiment analysis program based on Spark. To do this I'm using word2vec and KMeans clustering. From word2Vec I've got 20k word/vectors collection in 100 dimension space and now I'm trying to clusterize this vectors space. When I run KMeans with default parallel implementation the algorithm worked 3 hours! But with random initialization strategy it was like 8 minutes. What am I doing wrong? I have mac book pro machine with 4 kernels processor and 16 GB of RAM. K ~= 4000

Scipy Kmeans exits with TypeError

阅读更多关于 Scipy Kmeans exits with TypeError

问题 When running the code below, I'm getting a TypeError that says: "File "_vq.pyx", line 342, in scipy.cluster._vq.update_cluster_means TypeError: type other than float or double not supported" from PIL import Image import scipy, scipy.misc, scipy.cluster NUM_CLUSTERS = 5 im = Image.open('d:/temp/test.jpg') ar = scipy.misc.fromimage(im) shape = ar.shape ar = ar.reshape(scipy.product(shape[:2]), shape[2]) codes, dist = scipy.cluster.vq.kmeans(ar, NUM_CLUSTERS) vecs, dist = scipy.cluster.vq.vq(ar,