k-means

Cluster one-dimensional data optimally? [closed]

旧时模样 提交于 2019-12-17 05:03:12
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works? Or: what is the most optimal way to do k-means clustering in one-dimension? 回答1: Univariate k-means clustering can be solved in O(kn) time (on already sorted input) based on theoretical results on Monge matrices, but the approach was

Why am I not getting points around clusers in this kmeans implementation?

会有一股神秘感。 提交于 2019-12-14 04:13:07
问题 In below kmeans analysis I am assigning a 1 or 0 to indicate if word is associated with a user : cells = c(1,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,1,1,1,1) rnames = c("a1","a2","a3","a4","a5","a6","a7","a8","a9") cnames = c("google","so","test") x <- matrix(cells, nrow=9, ncol=3, byrow=TRUE, dimnames=list(rnames, cnames)) # run K-Means km <- kmeans(x, 3, 15) # print components of km print(km) # plot clusters plot(x, col = km$cluster) # plot centers points(km$centers, col = 1:2, pch = 8)

How can I vectorize Tweets using Spark's MLLib?

流过昼夜 提交于 2019-12-14 04:10:50
问题 I'd like to turn tweets into vectors for machine learning, so that I can categorize them based on content using Spark's K-Means clustering. Ex, all tweets relating to Amazon get put into one category. I have tried splitting the tweet into words and creating a vector using HashingTF, which wasn't very successful. Are there any other ways to vectorize tweets? 回答1: You can try this pipeline: First, tokenize the input Tweet (located in the column text ). basically, it creates a new column

R - 'princomp' can only be used with more units than variables

爷,独闯天下 提交于 2019-12-14 00:22:20
问题 I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph. "'princomp' can only be used with more units than variables" I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again. Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on

How to cluster with K-means, when number of clusters and their sizes are known [closed]

大城市里の小女人 提交于 2019-12-14 00:05:01
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . I'm clustering some data using scikit. I have the easiest possible task: I do know the number of clusters. And, I do know the size of each cluster . Is it possible to specify this information and relay it to the K-means function? 回答1: It won't be k-means anymore. K-means is

K means clustering mahout

最后都变了- 提交于 2019-12-13 10:29:54
问题 I am trying to cluster a sample dataset which is in csv file format. But when I give the below command, user@ubuntu:/usr/local/mahout/trunk$ bin/mahout kmeans -i /root/Mahout/temp/parsedtext-seqdir-sparse-kmeans/tfidf-vectors/ -c /root/Mahout/temp/parsedtext-kmeans-clusters -o /root/Mahout/reuters21578/root/Mahout/temp/parsedtext-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 2 -k 1 -ow --clustering -cl I am getting the following error, saying there is no input clusters

Python K-means clustering on document [closed]

纵然是瞬间 提交于 2019-12-13 09:46:32
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . Python code: subject1=['data mining','web mining','electronic engineering','cloud computing','Smart Biomaterials','Mathematical modeling'] subject2=['Computer Science','Engineering','Biology'] tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop

Understanding the quality of the KMeans algorithm

断了今生、忘了曾经 提交于 2019-12-13 07:01:25
问题 After reading Unbalanced factor of KMeans, I am trying to understand how this works. I mean, from my examples, I can see that the less the value of the factor, the better the quality of KMeans clustering, i.e. the more balanced are its clusters. But what is the naked mathematical interpretation of this factor? Is this a known quantity or something? Here are my examples: C1 = 10 C2 = 100 pdd = [(C1,10), (C2, 100)] n = 2 <-- #clusters total = 110 <-- #points uf = 10 * 10 + 100 * 100 uf = 100100

Similar images: Bag of Features / Visual Word or matching descriptors?

不羁的心 提交于 2019-12-13 05:26:16
问题 I have an application where given a reasonable amount of images (let's say 20K) and a query image, I want to find the most similar one. An reasonable approximation is feasible. In order to guarantee precision in representing each image, I'm using SIFT (a parallel version, to achieve fast computation also). Now, given the set of n SIFT descriptors (where 500<n<1000 usually, depending on image size), which can be represented as a matrix n x 128 , from what I've seen in literature there are two

Cannot change number of clusters in KMeansClustering Tensorflow

我们两清 提交于 2019-12-13 04:46:49
问题 I found this code and it works perfectly. THe idea - split my data and train KMeansClustering on it. So I create InitHook and iterator and use it for training. class _IteratorInitHook(tf.train.SessionRunHook): """Hook to initialize data iterator after session is created.""" def __init__(self): super(_IteratorInitHook, self).__init__() self.iterator_initializer_fn = None def after_create_session(self, session, coord): """Initialize the iterator after the session has been created.""" del coord