k-means

Getting an IOException when running a sample code in “Mahout in Action” on mahout-0.6

隐身守侯 提交于 2019-12-05 05:04:20
I'm learning Mahout and reading "Mahout in Action". When I tried to run the sample code in chapter7 SimpleKMeansClustering.java, an exception popped up: Exception in thread "main" java.io.IOException: wrong value class: 0.0: null is not class org.apache.mahout.clustering.WeightedPropertyVectorWritable at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1874) at SimpleKMeansClustering.main(SimpleKMeansClustering.java:95) I successed this code on mahout-0.5, but on mahout-0.6 I saw this exception. Even I changed directory name from clusters-0 to clusters-0-final, I'm still facing

OpenCV K-Means (kmeans2)

我怕爱的太早我们不能终老 提交于 2019-12-05 04:58:17
I'm using Opencv's K-means implementation to cluster a large set of 8-dimensional vectors. They cluster fine, but I can't find any way to see the prototypes created by the clustering process. Is this even possible? OpenCV only seems to give access to the cluster indexes (or labels). If not I guess it'll be time to make my own implementation! I can't say I used OpenCV's implementation of Kmeans, but if you have access to the labels given to each instance, you can simply get the centroids by calculating the average vector of instances belong to each of the clusters. As of (at least) OpenCV 2.0,

K-means Plotting for 3 Dimensional Data

◇◆丶佛笑我妖孽 提交于 2019-12-05 03:55:19
问题 I'm working with k-means in MATLAB. I am trying to create the plot/graph, but my data has three dimensional array. Here is my k-means code: clc clear all close all load cobat.txt; % read the file k=input('Enter a number: '); % determine the number of cluster isRand=0; % 0 -> sequeantial initialization % 1 -> random initialization [maxRow, maxCol]=size(cobat); if maxRow<=k, y=[m, 1:maxRow]; elseif k>7 h=msgbox('cant more than 7'); else % initial value of centroid if isRand, p = randperm(size

Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

随声附和 提交于 2019-12-05 03:24:14
问题 I'm trying to run a KMeans on MLLib from a (large) collection of text documents (TF-IDF vectors). Documents are sent through a Lucene English analyzer, and sparse vectors are created from HashingTF.transform() function. Whatever the degree of parrallelism I'm using (through the coalesce function), KMeans.train always return an OutOfMemory exception below. Any thought on how to tackle this issue ? Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at scala.reflect

What is the “seed” in Weka's SimpleKMeans clusterer?

断了今生、忘了曾经 提交于 2019-12-05 03:13:52
问题 I'm using Weka's SimpleKMeans clusterer on a set of data. But I'm unsure what the seed value is, what it does or how it affects the data. i.e. How does a higher or lower seed value affect the result as oppose to the default value of 10? 回答1: Seed is just a random numbers seed. Once seed is fixed, even a randomized algorithm behaves deterministically. KMeans is not deterministic, so if you want repeatable results - you fix a seed. However there is completely no relation between exact value of

Bag of Visual Words in Opencv

試著忘記壹切 提交于 2019-12-05 02:36:45
问题 I am using BOW in opencv for clustering the features of variable size. However one thing is not clear from the documentation of the opencv and also i am unable to find the reason for this question: assume: dictionary size = 100. I use surf to compute the features, and each image has variable size descriptors e.g.: 128 x 34, 128 x 63, etc. Now in BOW each of them are clustered and I get a fixed descriptor size of 128 x 100 for a image. I know 100 is the cluster center created using kmeans

Colouring ggplot's plotmatrix by k-means clusters?

…衆ロ難τιáo~ 提交于 2019-12-05 01:06:45
问题 I am trying to create a pairs plot of 6 data variables using ggplot2 and colour the points according to the k-means cluster they belong to. I read the documentation of the highly impressive 'GGally' package as well as an informal fix by Adam Laiacano [http://adamlaiacano.tumblr.com/post/13501402316/colored-plotmatrix-in-ggplot2]. Unfortunately, I could not find any way to get the desired output in either. Here is a sample code:- #The Swiss fertility dataset has been used here data_ <- read

How to detect multiple objects with OpenCV in C++?

久未见 提交于 2019-12-05 00:55:06
I got inspiration from this answer here , which is a Python implementation, but I need C++ , that answer works very well, I got the thought is that: detectAndCompute to get keypoints , use kmeans to segment them to clusters, then for each cluster do matcher->knnMatch with each's descriptors , then do the other stuffs like the common single detecting method. The main problem is, how to provide descriptors for each cluster's matcher->knnMatch process? I thought we should set value of the other keypoints corresponding descriptor to 0(useless), am I right? And got some problems in my trying: how

OpenCV running kmeans algorithm on an image

こ雲淡風輕ζ 提交于 2019-12-04 23:04:39
问题 I am trying to run kmeans on a 3 channel color image, but every time I try to run the function it seems to crash with the following error: OpenCV Error: Assertion failed (data.dims <= 2 && type == CV_32F && K > 0) in unknown function, file ..\..\..\OpenCV-2.3.0\modules\core\src\matrix.cpp, line 2271 I've included the code below with some comments to help specify what is being passed in. Any help is greatly appreciated. // Load in an image // Depth: 8, Channels: 3 IplImage* iplImage =

Document Clustering Basics

[亡魂溺海] 提交于 2019-12-04 21:43:31
So, I've been mulling over these concepts for some time, and my understanding is very basic. Information retrieval seems to be a topic seldom covered in the wild... My questions stem from the process of clustering documents. Let's say I start off with a collection of documents containing only interesting words. What is the first step here? Parse the words from each document and create a giant 'bag-of-words' type model? Do I then proceed to create vectors of word counts for each document? How do I compare these documents using something like the K-means clustering? Try Tf-idf for starters. If