k-means

聚类-K-Means

时光怂恿深爱的人放手 提交于 2019-12-06 11:40:47
1.什么是K-Means? K均值算法聚类 关键词:K个种子,均值 聚类的概念:一种无监督的学习,事先不知道类别,自动将相似的对象归到同一个簇中 K-Means算法是一种聚类分析(cluster analysis)的算法,其主要是来计算数据聚集的算法,主要通过不断地取离种子点最近均值的算法. K-Means算法的思想很简单,对于给定的样本集,按照样本之间的距离大小,将样本集划分为K个簇。让簇内的点尽量紧密的连在一起,而让簇间的距离尽量的大. 2.k-Means原理 每次计算距离采用的是欧式距离 步骤图: 步骤总结: 从数据中选择k个对象作为初始聚类中心; 计算每个聚类对象到聚类中心的距离来划分; 再次计算每个聚类中心 2~3步for循环,直到达到最大迭代次数,则停止,否则,继续操作。 确定最优的聚类中心 主要优点: 原理比较简单,实现也是很容易,收敛速度快。 聚类效果较优。 算法的可解释度比较强。 主要需要调参的参数仅仅是簇数k。 主要缺点: K是事先给定的,这个K值的选定是非常难以估计的。很多时候,事先并不知道给定的数据集应该分成多少个类别才最合适。(ISODATA算法通过类的自动合并和分裂,得到较为合理的类型数目K) K-Means算法需要用初始随机种子点来搞,这个随机种子点太重要,不同的随机种子点会有得到完全不同的结果。(K-Means++算法可以用来解决这个问题

how to set initial centers of K-means openCV c++

早过忘川 提交于 2019-12-06 11:27:22
I am trying to do a segmentation of an image using OpenCv and Kmeans, the code that I have just implemented is the following: #include "opencv2/objdetect/objdetect.hpp" #include "opencv2/highgui/highgui.hpp" #include "opencv2/imgproc/imgproc.hpp" #include <iostream> #include <stdio.h> using namespace std; using namespace cv; int main(int, char** argv) { Mat src, Imagen2, Imagris, labels, centers,imgfondo; src = imread("C:/Users/Sebastian/Documents/Visual Studio 2015/Projects/ClusteringImage/data/leon.jpg"); imgfondo = imread("C:/Users/Sebastian/Documents/Visual Studio 2015/Projects

11 K-Means 原理及案例

核能气质少年 提交于 2019-12-06 11:05:13
11 K-Means 原理及案例 非监督学习 unsupervised learning (非监督学习) ,只有特征值,没有目标值 聚类: 主要方法 - k-means (K - 需要分成的类别数) K-Means步骤 随机设置K个特征空间内的点作为初始的聚类中心 (红,绿,蓝) k=3 (给定) 对于其他每个点计算到K个中心的距离,未知的点选择最近的一个聚类 中心点作为标记类别,形成3个族群 分别计算这3个族群的平均值,把三个平均值与之前的三个旧中心进行比较。如果相同则结束聚类,如果不相同,把这三个平均点当做新的中心点,重复第二步。 Kmeans性能评估指标 注:对于每个点𝑖 为已聚类数据中的样本 ,𝑏_𝑖 为𝑖 到其它族群的所有样本的平均距离,𝑎_𝑖 为𝑖 到本身簇的距离平均值,最终计算出所有的样本点的轮廓系数平均值。 sc_i 取值 当b_i >>a_i 时, 外部距离远大于内部距离,为1, 完美情况。 当b_i <<a_i 时,内部距离远大于外部距离,为-1, 最差情况。 因此 取值范围为[-1,1] ,实际情况中超过0,或者0.1就已经算是不错的情况。 K-Means API sklearn.cluster.KMeans n_cluster=8 (开始的聚类中心数量) labels: 默认的标记类型(不是值),可以和真实值比较。 sklearn.metrics

数据挖掘--K-means

北城余情 提交于 2019-12-06 08:28:24
K-Means方法是MacQueen1967年提出的。给定一个数据集合X和一个整数K(n),K-Means方法是将X分成K个聚类并使得在每个聚类中所有值与该聚类中心距离的总和最小。 K-Means聚类方法分为以下几步: [1] 给K个cluster选择最初的中心点,称为K个Means。 [2] 计算每个对象和每个中心点之间的距离。 [3] 把每个对象分配给距它最近的中心点做属的cluster。 [4] 重新计算每个cluster的中心点。 [5] 重复2,3,4步,直到算法收敛。 以下几张图动态展示了这几个步骤: 下面,我们以一个具体的例子来说明一下K-means算法的实现。 K-means算法的优缺点: 优点: (1)对于处理大数据量具有可扩充性和高效率。算法的复杂度是O(tkn),其中n是对象的个数,k是cluster的个数,t是循环的次数,通常k,t<<n。 (2)可以实现局部最优化,如果要找全局最优,可以用退火算法或者遗传算法 缺点: (1)Cluster的个数必须事先确定,在有些应用中,事先并不知道cluster的个数。 (2)K个中心点必须事先预定,而对于有些字符属性,很难确定中心点。 (3)不能处理噪音数据。 (4)不能处理有些分布的数据(例如凹形) K-Means方法的变种 (1) K-Modes :处理分类属性 (2) K-Prototypes

I have 2,000,000 points in 100 dimensionality space. How can I cluster them to K (e.g., 1000) clusters?

旧街凉风 提交于 2019-12-06 08:15:59
问题 The problem comes as follows. I have M images and extract N features for each image, and the dimensionality of each feature is L. Thus, I have M*N features (2,000,000 for my case) and each feature has L dimensionality (100 for my case). I need to cluster these M*N features into K clusters. How can I do it? Thanks. 回答1: Do you want 1000 clusters of images, or of features, or of (image, feature) pairs ? In any case, it sounds as though you'll have to reduce the data and use simpler methods. One

Show rows on clustered kmeans data

烂漫一生 提交于 2019-12-06 07:14:59
Hi I was wondering when you cluster data on the figure screen is there a way to show which rows the data points belong to when you scroll over them? From the picture above I was hoping there would be a way in which if I select or scroll over the points that I could tell which row it belonged to. Here is the code: %% dimensionality reduction columns = 6 [U,S,V]=svds(fulldata,columns); %% randomly select dataset rows = 1000; columns = 6; %# pick random rows indX = randperm( size(fulldata,1) ); indX = indX(1:rows); %# pick random columns indY = randperm( size(fulldata,2) ); indY = indY(1:columns)

How to vectorize json data for KMeans?

[亡魂溺海] 提交于 2019-12-06 05:40:01
I have a number of questions and choices which users are going to answer. They have the format like this: question_id, text, choices And for each user I store the answered questions and selected choice by each user as a json in mongodb: {user_id: "", "question_answers" : [{"question_id": "choice_id", ..}] } Now I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions but I need to convert my user data to some vector numbers like the example in Spark's Docs here . kmeans data sample and my desired output: 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0

R - cluster analysis on binary weblog data

风流意气都作罢 提交于 2019-12-06 05:09:16
I have a web data that looks similar to the sample below. It simply has the user and binary value for whether that user cliked on a particular link within a website. I wanted to do some clustering of this data. My main goal is to find similar users based on their online behaviour. What is a good clustering alorithm for this? I have tried k-means which does not work well with binary data. I have also tried spherical k-means skmeans() . I wanted to do a sum of squared error scree plot, but I could not figure out how to get SSE from skmeans. User link1 link2 link3 link4 abc1 0 1 1 1 abc2 1 0 1 0

unstable result from scipy.cluster.kmeans

╄→尐↘猪︶ㄣ 提交于 2019-12-06 03:15:39
The following code gives different results at every runtime while clustering the data into 3 parts using the k means method: from numpy import array from scipy.cluster.vq import kmeans,vq data = array([1,1,1,1,1,1,3,3,3,3,3,3,7,7,7,7,7,7]) centroids = kmeans(data,3,100) #with 100 iterations print (centroids) Three possible results obtained were: (array([1, 3, 7]), 0.0) (array([3, 7, 1]), 0.0) (array([7, 3, 1]), 0.0) Actually, the order of the calculated k means are different. But, does not it unstable to assign which k means point belongs to which cluster? Any idea?? That's because if you pass

Implementing the Elbow Method for finding the optimum number of clusters for K-Means Clustering in R [closed]

ε祈祈猫儿з 提交于 2019-12-06 01:38:58
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . I want to use K-Means Clustering for my dataset. I am using the kmeans() function in R for doing this. k<-kmeans(data,centers=3) plotcluster(m,k$cluster) However i am not sure what is the correct value of K for this function. I want to try using the Elbow Method for this. Are there any packages in R which