k-means

How to use silhouette score in k-means clustering from sklearn library?

落花浮王杯 提交于 2019-12-04 19:50:58
问题 I'd like to use silhouette score in my script, to automatically compute number of clusters in k-means clustering from sklearn. import numpy as np import pandas as pd import csv from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score filename = "CSV_BIG.csv" # Read the CSV file with the Pandas lib. path_dir = ".\\" dataframe = pd.read_csv(path_dir + filename, encoding = "utf-8", sep = ';' ) # "ISO-8859-1") df = dataframe.copy(deep=True) #Use silhouette score range_n

实战Mahout聚类算法Canopy+K-means

旧巷老猫 提交于 2019-12-04 18:57:09
Mahout是Apache的顶级开源项目,它由Lucene衍生而来,且基于Hadoop的,对处理大规模数据的机器学习的经典算法提供了高效的实现。其中,对经典的聚类算法即提供了单机实现,同时也提供了基于hadoop分布式的实现,都是非常好的学习资料。 聚类分析 聚类(Clustering)可以简单的理解为将数据对象分为多个 簇(Cluster),每个 簇 里的所有数据对象具有一定的相似性,这样一个 簇可以看多一个整体对待,以此可以提高计算质量或减少计算量。而数据对象间相似性的衡量有不少经典算法可以用,但它们所需的数据结构基本一致,那就是向量;常见的有 欧几里得距离算法、余弦距离算法、皮尔逊相关系数算法等,Mahout对此都提供了实现,并且你可以在实现自己的聚类时,通过接口切换不同的距离算法。 数据模型 在Mahout的聚类分析的计算过程中,数据对象会转化成向量( Vector )参与运算,在Mahout中的接口是 org.apache.mahout.math.Vector 它里面每个域用一个浮点数( double )表示,你可以通过继承Mahout里的基类如: AbstractVector来实现自己的向量模型,也可以直接使用一些它提供的已有实现如下: 1. DenseVector,它的实现就是一个浮点数数组,对向量里所有域都进行存储,适合用于存储密集向量。 2.

Accurately detect color regions in an image using K-means clustering

与世无争的帅哥 提交于 2019-12-04 18:35:30
I'm using K-means clustering in color-based image segmentation. I have a 2D image which has 3 colors, black, white, and green. Here is the image, I want K-means to produce 3 clusters, one represents the green color region, the second one represents the white region, and the last one represents the black region. Here is the code I used, %Clustering color regions in an image. %Step 1: read the image using imread, and show it using imshow. img = (imread('img.jpg')); figure, imshow(img), title('X axis rock cut'); %figure is for creating a figure window. text(size(img,2),size(img,1)+15,...

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

烈酒焚心 提交于 2019-12-04 17:24:37
问题 I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form: When I am executing k <- kmeans(norm,center=3) I am receving the following error: Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) Can you help me? Thank you! 回答1: kmeans cannot handle data that has NA values. The mean and variance are then no longer well defined, and you don't know anymore which center is closest. 回答2: Error in do_one(nmeth) : NA/NaN/Inf

使用高斯混合模型建立更精确的聚类

笑着哭i 提交于 2019-12-04 16:19:30
介绍 我很喜欢研究无监督学习问题。它们为监督学习问题提供了一个完全不同的挑战,用我拥有的数据进行实验的发挥空间要比监督学习大得多。毫无疑问,机器学习领域的大多数发展和突破都发生在无监督学习领域。 无监督学习中最流行的技术之一就是聚类。这是一个我们通常在机器学习的早期学习的概念,它很容易理解。我相信你曾经遇到过,甚至参与过顾客细分、购物篮分析等项目。 但问题是聚类有很多方面。它并不局限于我们之前学过的基本算法。它是一种强大的无监督学习技术,我们可以在现实世界中准确地使用它。 > 高斯混合模型就是我想在本文中讨论的一种聚类算法。 想预测一下你最喜欢的产品的销售情况吗?或许你想通过不同客户群体的视角来理解客户流失。无论用什么方法,你都会发现高斯混合模型非常有用。 在本文中,我们将采用自下而上的方法。因此,我们首先来看一下聚类的基础知识,包括快速回顾一下k-means算法。然后,我们将深入讨论高斯混合模型的概念,并在Python中实现它们。 目录 聚类简介 k-means聚类简介 k-means聚类的缺点 介绍高斯混合模型 高斯分布 期望最大化EM算法 高斯混合模型的期望最大化 在Python中实现用于聚类的高斯混合模型 聚类简介 在我们开始讨论高斯混合模型的实质内容之前,让我们快速更新一些基本概念。 注意:如果你已经熟悉了聚类背后的思想以及k-means聚类算法的工作原理

Color quantization of an image using K-means clustering (using RGB features)

此生再无相见时 提交于 2019-12-04 13:58:04
问题 Is it possible to clustering for RGB + spatial features of images with matlab? NOTE: I want to use kmeans for clustering. In fact basicly i want to do one thing, i want to get this image from this 回答1: I think you are looking for color quantization. [imgQ,map]= rgb2ind(img,4,'nodither'); %change this 4 to the number of desired colors %in quantized image imshow(imgQ,map); Result: Using kmeans : %img is the original image imgVec=[reshape(img(:,:,1),[],1) reshape(img(:,:,2),[],1) reshape(img(:,:

Interpreting output from mahout clusterdumper

最后都变了- 提交于 2019-12-04 12:35:52
问题 I ran a clustering test on crawled pages (more than 25K docs ; personal data set). I've done a clusterdump : $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt The output after running cluster dumper is shown 25 elements "VL-xxxxx {}" : VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]} ..

I have 2,000,000 points in 100 dimensionality space. How can I cluster them to K (e.g., 1000) clusters?

旧巷老猫 提交于 2019-12-04 12:22:02
The problem comes as follows. I have M images and extract N features for each image, and the dimensionality of each feature is L. Thus, I have M*N features (2,000,000 for my case) and each feature has L dimensionality (100 for my case). I need to cluster these M*N features into K clusters. How can I do it? Thanks. Do you want 1000 clusters of images, or of features, or of (image, feature) pairs ? In any case, it sounds as though you'll have to reduce the data and use simpler methods. One possibility is two-pass K-cluster: a) split the 2 million data points into 32 clusters, b) split each of

Kmeans matlab “Empty cluster created at iteration 1” error

拜拜、爱过 提交于 2019-12-04 10:49:26
问题 I'm using this script to cluster a set of 3D points using the kmeans matlab function but I always get this error "Empty cluster created at iteration 1". The script I'm using: [G,C] = kmeans(XX, K, 'distance','sqEuclidean', 'start','sample'); XX can be found in this link XX value and the K is set to 3 So if anyone could please advise me why this is happening. 回答1: It is simply telling you that during the assign-recompute iterations, a cluster became empty (lost all assigned points). This is

ELKI Kmeans clustering Task failed error for high dimensional data

只愿长相守 提交于 2019-12-04 05:51:43
问题 I have a 60000 documents which i processed in gensim and got a 60000*300 matrix. I exported this as a csv file. When i import this in ELKI environment and run Kmeans clustering, i am getting below error. Task failed de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation