k-means

How do I obtain individual centroids of K mean cluster using nltk (python)

[亡魂溺海] 提交于 2020-01-25 07:32:05
问题 I have used nltk to perform k mean clustering as I would like to change the distance metrics to cosine distance. However, how do I obtain the centroids of all the clusters? kclusterer = KMeansClusterer(8, distance = nltk.cluster.util.cosine_distance, repeats = 1) predict = kclusterer.cluster(features, assign_clusters = True) centroids = kclusterer._centroid df_clustering['cluster'] = predict #df_clustering['centroid'] = centroids[df_clustering['cluster'] - 1].tolist() df_clustering['centroid'

k-means算法

五迷三道 提交于 2020-01-24 02:07:35
k-means算法为常见的聚类算法。 算法的大致思路为: 首先指定要分多少类(k),然后指定k个类中心点的初始坐标。 线性扫描所有的点,计算其至各中心点的距离,选择最小距离使自己归于此类。然后更新各个类中心点的坐标。 迭代上述步骤直至达到收敛条件。 伪码如下: 但是,k初始以及中心点坐标初始值的随机选定可能会对结果造成影响。 k初始值我们可以采用 手肘法 : (图片取自知乎@是泽哥啊) 取每个点至其所在类中心点的总距离,然后取其 拐点 (3)。 还可以采用 蒙特卡洛模拟 ,自动取得k值。 解决随机选取中心点的坐标的问题可以采用**k-means++**算法。 但是,每次都要线性扫描所有的比较耗时。 为了提高效率,我们可以采用 kd树 。 kd树即在n维空间(坐标为n维时),以垂直于坐标轴的方法进行空间切分。且切分位置为其坐标的中位数,如下图所示: kd树为二叉树,初始时,二叉树的根节点为整个坐标空间,然后左右子树为被其切分的两个区域,重复,直到切分到区域中没有坐标点为止。 来源: CSDN 作者: JLUspring 链接: https://blog.csdn.net/qq_37724465/article/details/103834700

initial centroids for scikit-learn kmeans clustering

随声附和 提交于 2020-01-23 10:59:07
问题 if I already have a numpy array that can serve as the initial centroids, how can I properly initialize the kmeans algorithm? I am using the scikit-learn Kmeans class this post (k-means with selected initial centers) indicates that I only need to set n_init=1 if I am using a numpy array as the initial centroids but I am not sure if my initialization is working properly Naftali Harris' excellent visualization page shows what I am trying to do http://www.naftaliharris.com/blog/visualizing-k

Microsoft SQL and R, stored procedure and k-means

你。 提交于 2020-01-23 09:53:08
问题 I am new here, hope to help and be helped. However, I am working on the new Microsoft Sql Server Management Studio (2016), using its new features that imply the integration with R. First of all, my goal is to create a stored procedure that perform a K-Means clustering with x and y column. The problem is that I am stuck in the middle, because I am not able to decline the online documentation to my case. Here the script CREATE TABLE [dbo].[ModelTable] ( column_name1 varchar(8000) ) ; CREATE

Reveal k-modes cluster features

五迷三道 提交于 2020-01-22 06:00:46
问题 I'm performing a cluster analysis on categorical data, hence using k-modes approach. My data is shaped as a preference survey: How do you like hair and eyes? The respondent can pick up an answers from a fixed (multiple choice) set of 4 possibility. I therefore get the dummies, apply k-modes, attach the clusters back to the initial df and then plot them in 2D with pca. My code looks like: import numpy as np import pandas as pd from kmodes import kmodes df_dummy = pd.get_dummies(df) #transform

How Could One Implement the K-Means++ Algorithm?

末鹿安然 提交于 2020-01-19 06:33:12
问题 I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm. Is the probability function used based on distance or Gaussian? In the same time the most long distant point (From the other centroids) is picked for a new centroid. I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well

How Could One Implement the K-Means++ Algorithm?

大城市里の小女人 提交于 2020-01-19 06:32:05
问题 I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm. Is the probability function used based on distance or Gaussian? In the same time the most long distant point (From the other centroids) is picked for a new centroid. I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well

K-means聚类算法原理及c++实现

孤人 提交于 2020-01-18 03:56:56
聚类是指根据数据本身的特征对数据进行分类,不需要人工标注,是无监督学习的一种。k-means算法是聚类算法中最简单的算法之一。 k-means 算法将n个数据对象划分为 k个聚类以便使得所获得的聚类满足:同一聚类中的对象相似度较高;而不同聚类中的对象相似度较小。聚类相似度是利用各聚类中对象的均值所获得一个“中心对象”(引力中心)来进行计算的。 基于这样一个假设,我们再来导出k-means所要优化的目标函数:设我们一共有N个数据点需要分为K个cluster,而k-means要做的就是要最小化这个目标函数 为第k个类聚中心, 当第n个数据属于第k类时为1,否则为0。 过程如下: 1.首先从n个数据对象任意选择 k 个对象作为初始聚类中心;而对于所剩下其它对象,则根据它们与这些聚类中心的相似度(距离),分别将它们分配给与其最相似的(聚类中心所代表的)聚类; 2.然后再计算每个所获新聚类的聚类中心(该聚类中所有对象的 均值 );不断重复这一过程直到标准测度函数开始收敛为止。 一般都采用均方差作为标准测度函数,k个聚类具有以下特点:各聚类本身尽可能的紧凑,而各聚类之间尽可能的分开。 每一次更新聚类中心都会使目标函数减小,因此迭代最终J会达到一个极小值,不能保证是全局最小值。k-means对于噪声十分敏感。 c++实现: class ClusterMethod { private:

KMeans|| for sentiment analysis on Spark

这一生的挚爱 提交于 2020-01-15 03:05:08
问题 I'm trying to write sentiment analysis program based on Spark. To do this I'm using word2vec and KMeans clustering. From word2Vec I've got 20k word/vectors collection in 100 dimension space and now I'm trying to clusterize this vectors space. When I run KMeans with default parallel implementation the algorithm worked 3 hours! But with random initialization strategy it was like 8 minutes. What am I doing wrong? I have mac book pro machine with 4 kernels processor and 16 GB of RAM. K ~= 4000

Scipy Kmeans exits with TypeError

陌路散爱 提交于 2020-01-15 01:51:14
问题 When running the code below, I'm getting a TypeError that says: "File "_vq.pyx", line 342, in scipy.cluster._vq.update_cluster_means TypeError: type other than float or double not supported" from PIL import Image import scipy, scipy.misc, scipy.cluster NUM_CLUSTERS = 5 im = Image.open('d:/temp/test.jpg') ar = scipy.misc.fromimage(im) shape = ar.shape ar = ar.reshape(scipy.product(shape[:2]), shape[2]) codes, dist = scipy.cluster.vq.kmeans(ar, NUM_CLUSTERS) vecs, dist = scipy.cluster.vq.vq(ar,