k-means | 易学教程

Getting an IOException when running a sample code in “Mahout in Action” on mahout-0.6

阅读更多关于 Getting an IOException when running a sample code in “Mahout in Action” on mahout-0.6

I'm learning Mahout and reading "Mahout in Action". When I tried to run the sample code in chapter7 SimpleKMeansClustering.java, an exception popped up: Exception in thread "main" java.io.IOException: wrong value class: 0.0: null is not class org.apache.mahout.clustering.WeightedPropertyVectorWritable at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1874) at SimpleKMeansClustering.main(SimpleKMeansClustering.java:95) I successed this code on mahout-0.5, but on mahout-0.6 I saw this exception. Even I changed directory name from clusters-0 to clusters-0-final, I'm still facing

OpenCV K-Means (kmeans2)

阅读更多关于 OpenCV K-Means (kmeans2)

I'm using Opencv's K-means implementation to cluster a large set of 8-dimensional vectors. They cluster fine, but I can't find any way to see the prototypes created by the clustering process. Is this even possible? OpenCV only seems to give access to the cluster indexes (or labels). If not I guess it'll be time to make my own implementation! I can't say I used OpenCV's implementation of Kmeans, but if you have access to the labels given to each instance, you can simply get the centroids by calculating the average vector of instances belong to each of the clusters. As of (at least) OpenCV 2.0,

K-means Plotting for 3 Dimensional Data

阅读更多关于 K-means Plotting for 3 Dimensional Data

问题 I'm working with k-means in MATLAB. I am trying to create the plot/graph, but my data has three dimensional array. Here is my k-means code: clc clear all close all load cobat.txt; % read the file k=input('Enter a number: '); % determine the number of cluster isRand=0; % 0 -> sequeantial initialization % 1 -> random initialization [maxRow, maxCol]=size(cobat); if maxRow<=k, y=[m, 1:maxRow]; elseif k>7 h=msgbox('cant more than 7'); else % initial value of centroid if isRand, p = randperm(size

Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

阅读更多关于 Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

问题 I'm trying to run a KMeans on MLLib from a (large) collection of text documents (TF-IDF vectors). Documents are sent through a Lucene English analyzer, and sparse vectors are created from HashingTF.transform() function. Whatever the degree of parrallelism I'm using (through the coalesce function), KMeans.train always return an OutOfMemory exception below. Any thought on how to tackle this issue ? Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at scala.reflect

What is the “seed” in Weka's SimpleKMeans clusterer?

阅读更多关于 What is the “seed” in Weka's SimpleKMeans clusterer?

问题 I'm using Weka's SimpleKMeans clusterer on a set of data. But I'm unsure what the seed value is, what it does or how it affects the data. i.e. How does a higher or lower seed value affect the result as oppose to the default value of 10? 回答1: Seed is just a random numbers seed. Once seed is fixed, even a randomized algorithm behaves deterministically. KMeans is not deterministic, so if you want repeatable results - you fix a seed. However there is completely no relation between exact value of

Bag of Visual Words in Opencv

阅读更多关于 Bag of Visual Words in Opencv

问题 I am using BOW in opencv for clustering the features of variable size. However one thing is not clear from the documentation of the opencv and also i am unable to find the reason for this question: assume: dictionary size = 100. I use surf to compute the features, and each image has variable size descriptors e.g.: 128 x 34, 128 x 63, etc. Now in BOW each of them are clustered and I get a fixed descriptor size of 128 x 100 for a image. I know 100 is the cluster center created using kmeans

Colouring ggplot's plotmatrix by k-means clusters?

阅读更多关于 Colouring ggplot's plotmatrix by k-means clusters?

问题 I am trying to create a pairs plot of 6 data variables using ggplot2 and colour the points according to the k-means cluster they belong to. I read the documentation of the highly impressive 'GGally' package as well as an informal fix by Adam Laiacano [http://adamlaiacano.tumblr.com/post/13501402316/colored-plotmatrix-in-ggplot2]. Unfortunately, I could not find any way to get the desired output in either. Here is a sample code:- #The Swiss fertility dataset has been used here data_ <- read

How to detect multiple objects with OpenCV in C++?

阅读更多关于 How to detect multiple objects with OpenCV in C++?

I got inspiration from this answer here , which is a Python implementation, but I need C++ , that answer works very well, I got the thought is that: detectAndCompute to get keypoints , use kmeans to segment them to clusters, then for each cluster do matcher->knnMatch with each's descriptors , then do the other stuffs like the common single detecting method. The main problem is, how to provide descriptors for each cluster's matcher->knnMatch process? I thought we should set value of the other keypoints corresponding descriptor to 0(useless), am I right? And got some problems in my trying: how

OpenCV running kmeans algorithm on an image

阅读更多关于 OpenCV running kmeans algorithm on an image

问题 I am trying to run kmeans on a 3 channel color image, but every time I try to run the function it seems to crash with the following error: OpenCV Error: Assertion failed (data.dims <= 2 && type == CV_32F && K > 0) in unknown function, file ..\..\..\OpenCV-2.3.0\modules\core\src\matrix.cpp, line 2271 I've included the code below with some comments to help specify what is being passed in. Any help is greatly appreciated. // Load in an image // Depth: 8, Channels: 3 IplImage* iplImage =

Document Clustering Basics

阅读更多关于 Document Clustering Basics

So, I've been mulling over these concepts for some time, and my understanding is very basic. Information retrieval seems to be a topic seldom covered in the wild... My questions stem from the process of clustering documents. Let's say I start off with a collection of documents containing only interesting words. What is the first step here? Parse the words from each document and create a giant 'bag-of-words' type model? Do I then proceed to create vectors of word counts for each document? How do I compare these documents using something like the K-means clustering? Try Tf-idf for starters. If