cluster-analysis

Cluster one-dimensional data optimally? [closed]

旧时模样 提交于 2019-12-17 05:03:12
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works? Or: what is the most optimal way to do k-means clustering in one-dimension? 回答1: Univariate k-means clustering can be solved in O(kn) time (on already sorted input) based on theoretical results on Monge matrices, but the approach was

Finding circles in scatter

痞子三分冷 提交于 2019-12-14 03:32:46
问题 I am working with a sensor and I have collected data which represented in two lists: filtered_x = [64, 90, 83, 78, 57, 58....] filtered_y = [26, 17, 63, 21, 62, 86....] Which gives the following scatter plot: This is part of a calibration process. In order to complete this, I have to find the circles in the plot and come up with coordinates of the centers of the circles so the sensor can be calibrated. Which libraries should I use and how do I go about doing this? I have come across nearest K

Cluster similar curves considering “belongingness”?

我的未来我决定 提交于 2019-12-14 03:14:04
问题 Currently, I have 6 curves shown in 6 different colors as below. The 6 curves are in fact generated by 6 trials of one same experiment . That means, ideally they should be the same curve, but due to the noise and different trial participants, they just look similar but not exactly the same. Now I wish to create an algorithm that is able to identify that the 6 curves are essentially the same and cluster them together into one cluster. What similarity metrics should I use? Note: The x-axis does

Louvain community detection in R using igraph - assigns alternating group membership assignment

无人久伴 提交于 2019-12-14 02:16:11
问题 I have been running Louvain community detection in R using igraph, with thanks to this answer for my previous query. However, I found that the cluster_louvain method seemed to do something strange with assigning group membership, which I think was due to an error in how I imported my data. Whilst I think I resolved this I would like to understand what the problem was. I ran louvain clustering on a 400x400 correlation matrix (i.e. correlation scores for 400 individuals). When I initially

R - 'princomp' can only be used with more units than variables

爷,独闯天下 提交于 2019-12-14 00:22:20
问题 I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph. "'princomp' can only be used with more units than variables" I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again. Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on

How to cluster with K-means, when number of clusters and their sizes are known [closed]

大城市里の小女人 提交于 2019-12-14 00:05:01
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . I'm clustering some data using scikit. I have the easiest possible task: I do know the number of clusters. And, I do know the size of each cluster . Is it possible to specify this information and relay it to the K-means function? 回答1: It won't be k-means anymore. K-means is

short text clustering with large dataset - user profiling

我与影子孤独终老i 提交于 2019-12-13 18:09:56
问题 Let me explain what I want to do: Input A csv file with millions of rows containing each one of them: id of the user and a string containing the list of keywords used by that user separated by spaces. The format of the second field, the string, is not so important, I can change that based on my needs, for example adding the counts of those keywords. The data comes from the Twitter database: users are Twitter users and keywords are "meaningful" words taken from their tweets (how is not

R combine rows with similar values

廉价感情. 提交于 2019-12-13 16:41:32
问题 I have a dataframe and the row values are first ordered from smallest to largest. I compute row value differences between adjacent rows, combine rows with similar differences (e.g., smaller than 1), and return averaged values of combined rows. I could check each row differences with a for loop, but seems a very inefficient way. Any better ideas? Thanks. library(dplyr) DF <- data.frame(ID=letters[1:12], Values=c(1, 2.2, 3, 5, 6.2, 6.8, 7, 8.5, 10, 12.2, 13, 14)) DF <- DF %>% mutate(Diff=c(0,

K means clustering mahout

最后都变了- 提交于 2019-12-13 10:29:54
问题 I am trying to cluster a sample dataset which is in csv file format. But when I give the below command, user@ubuntu:/usr/local/mahout/trunk$ bin/mahout kmeans -i /root/Mahout/temp/parsedtext-seqdir-sparse-kmeans/tfidf-vectors/ -c /root/Mahout/temp/parsedtext-kmeans-clusters -o /root/Mahout/reuters21578/root/Mahout/temp/parsedtext-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 2 -k 1 -ow --clustering -cl I am getting the following error, saying there is no input clusters

Python K-means clustering on document [closed]

纵然是瞬间 提交于 2019-12-13 09:46:32
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . Python code: subject1=['data mining','web mining','electronic engineering','cloud computing','Smart Biomaterials','Mathematical modeling'] subject2=['Computer Science','Engineering','Biology'] tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop