k-means

Bag of Visual Words in Opencv

走远了吗. 提交于 2019-12-03 20:39:33
I am using BOW in opencv for clustering the features of variable size. However one thing is not clear from the documentation of the opencv and also i am unable to find the reason for this question: assume: dictionary size = 100. I use surf to compute the features, and each image has variable size descriptors e.g.: 128 x 34, 128 x 63, etc. Now in BOW each of them are clustered and I get a fixed descriptor size of 128 x 100 for a image. I know 100 is the cluster center created using kmeans clustering. But I am confused in that, if image has 128 x 63 descriptors, than how come it clusters into

K-means Plotting for 3 Dimensional Data

一个人想着一个人 提交于 2019-12-03 20:33:42
I'm working with k-means in MATLAB. I am trying to create the plot/graph, but my data has three dimensional array. Here is my k-means code: clc clear all close all load cobat.txt; % read the file k=input('Enter a number: '); % determine the number of cluster isRand=0; % 0 -> sequeantial initialization % 1 -> random initialization [maxRow, maxCol]=size(cobat); if maxRow<=k, y=[m, 1:maxRow]; elseif k>7 h=msgbox('cant more than 7'); else % initial value of centroid if isRand, p = randperm(size(cobat,1)); % random initialization for i=1:k c(i,:)=cobat(p(i),:); end else for i=1:k c(i,:)=cobat(i,:);

非监督学习之k-means

孤人 提交于 2019-12-03 17:29:56
K-means通常被称为劳埃德算法,这在数据聚类中是最经典的,也是相对容易理解的模型。算法执行的过程分为4个阶段: 1、随机设置K个特征空间内的点作为初始的聚类中心 2、对于其他每个点计算到K个中心的距离,未知的点选择最近的一个聚类中心点作为标记类别 3、接着对着标记的聚类中心之后,重新计算出每个聚类的新中心点(平均值) 4、如果计算得出的新中心点与原中心点一样,那么结束,否则重新进行 第二步过程 sklearn.cluster.KMeans class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto') """ :param n_clusters:要形成的聚类数以及生成的质心数 :param init:初始化方法,默认为'k-means ++',以智能方式选择k-均值聚类的初始聚类中心,以加速收敛;random,从初始质心数据中随机选择k个观察值(行 :param n_init:int,默认值:10使用不同质心种子运行k-means算法的时间

Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

谁说胖子不能爱 提交于 2019-12-03 17:29:14
I'm trying to run a KMeans on MLLib from a (large) collection of text documents (TF-IDF vectors). Documents are sent through a Lucene English analyzer, and sparse vectors are created from HashingTF.transform() function. Whatever the degree of parrallelism I'm using (through the coalesce function), KMeans.train always return an OutOfMemory exception below. Any thought on how to tackle this issue ? Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138) at scala.reflect.ManifestFactory$$anon$12.newArray

What is the time complexity of k-means?

回眸只為那壹抹淺笑 提交于 2019-12-03 16:58:16
问题 I was going through the k-means Wikipedia page. Based on the algorithm, I think the complexity is O(n*k*i) ( n = total elements, k = number of cluster iteration) So can someone explain me this statement from Wikipedia and how is this NP hard? If k and d (the dimension) are fixed, the problem can be exactly solved in time O(n dk+1 log n) , where n is the number of entities to be clustered. 回答1: It depends on what you call k -means. The problem of finding the global optimum of the k-means

Python Clustering Algorithms

Deadly 提交于 2019-12-03 16:26:53
问题 I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question). I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or

Colouring ggplot's plotmatrix by k-means clusters?

空扰寡人 提交于 2019-12-03 16:12:14
I am trying to create a pairs plot of 6 data variables using ggplot2 and colour the points according to the k-means cluster they belong to. I read the documentation of the highly impressive 'GGally' package as well as an informal fix by Adam Laiacano [http://adamlaiacano.tumblr.com/post/13501402316/colored-plotmatrix-in-ggplot2]. Unfortunately, I could not find any way to get the desired output in either. Here is a sample code:- #The Swiss fertility dataset has been used here data_ <- read.csv("/home/tejaskale/Ubuntu\ One/IUCAA/Datasets/swiss.csv", header=TRUE) data_ <- na.omit(data_) u <- c(2

OpenCV running kmeans algorithm on an image

旧街凉风 提交于 2019-12-03 15:10:32
I am trying to run kmeans on a 3 channel color image, but every time I try to run the function it seems to crash with the following error: OpenCV Error: Assertion failed (data.dims <= 2 && type == CV_32F && K > 0) in unknown function, file ..\..\..\OpenCV-2.3.0\modules\core\src\matrix.cpp, line 2271 I've included the code below with some comments to help specify what is being passed in. Any help is greatly appreciated. // Load in an image // Depth: 8, Channels: 3 IplImage* iplImage = cvLoadImage("C:/TestImages/rainbox_box.jpg"); // Create a matrix to the image cv::Mat mImage = cv::Mat(iplImage

How to use silhouette score in k-means clustering from sklearn library?

南笙酒味 提交于 2019-12-03 15:08:32
I'd like to use silhouette score in my script, to automatically compute number of clusters in k-means clustering from sklearn. import numpy as np import pandas as pd import csv from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score filename = "CSV_BIG.csv" # Read the CSV file with the Pandas lib. path_dir = ".\\" dataframe = pd.read_csv(path_dir + filename, encoding = "utf-8", sep = ';' ) # "ISO-8859-1") df = dataframe.copy(deep=True) #Use silhouette score range_n_clusters = list (range(2,10)) print ("Number of clusters from 2 to 9: \n", range_n_clusters) for n

How to add k-means predicted clusters in a column to a dataframe in Python

旧街凉风 提交于 2019-12-03 13:19:31
问题 Have a question about kmeans clustering in python. So I did the analysis that way: from sklearn.cluster import KMeans km = KMeans(n_clusters=12, random_state=1) new = data._get_numeric_data().dropna(axis=1) kmeans.fit(new) predict=km.predict(new) How can I add the column with cluster results to my first dataframe "data" as an additional column? Thanks! 回答1: Assuming the column length is as the same as each column in you dataframe df , all you need to do is this: df['NEW_COLUMN'] = Series