cluster-analysis | 易学教程

Cluster one-dimensional data optimally? [closed]

阅读更多关于 Cluster one-dimensional data optimally? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works? Or: what is the most optimal way to do k-means clustering in one-dimension? 回答1: Univariate k-means clustering can be solved in O(kn) time (on already sorted input) based on theoretical results on Monge matrices, but the approach was

Finding circles in scatter

阅读更多关于 Finding circles in scatter

问题 I am working with a sensor and I have collected data which represented in two lists: filtered_x = [64, 90, 83, 78, 57, 58....] filtered_y = [26, 17, 63, 21, 62, 86....] Which gives the following scatter plot: This is part of a calibration process. In order to complete this, I have to find the circles in the plot and come up with coordinates of the centers of the circles so the sensor can be calibrated. Which libraries should I use and how do I go about doing this? I have come across nearest K

Cluster similar curves considering “belongingness”?

阅读更多关于 Cluster similar curves considering “belongingness”?

问题 Currently, I have 6 curves shown in 6 different colors as below. The 6 curves are in fact generated by 6 trials of one same experiment . That means, ideally they should be the same curve, but due to the noise and different trial participants, they just look similar but not exactly the same. Now I wish to create an algorithm that is able to identify that the 6 curves are essentially the same and cluster them together into one cluster. What similarity metrics should I use? Note: The x-axis does

Louvain community detection in R using igraph - assigns alternating group membership assignment

阅读更多关于 Louvain community detection in R using igraph - assigns alternating group membership assignment

问题 I have been running Louvain community detection in R using igraph, with thanks to this answer for my previous query. However, I found that the cluster_louvain method seemed to do something strange with assigning group membership, which I think was due to an error in how I imported my data. Whilst I think I resolved this I would like to understand what the problem was. I ran louvain clustering on a 400x400 correlation matrix (i.e. correlation scores for 400 individuals). When I initially

R - 'princomp' can only be used with more units than variables

阅读更多关于 R - 'princomp' can only be used with more units than variables

问题 I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph. "'princomp' can only be used with more units than variables" I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again. Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on

How to cluster with K-means, when number of clusters and their sizes are known [closed]

阅读更多关于 How to cluster with K-means, when number of clusters and their sizes are known [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . I'm clustering some data using scikit. I have the easiest possible task: I do know the number of clusters. And, I do know the size of each cluster . Is it possible to specify this information and relay it to the K-means function? 回答1: It won't be k-means anymore. K-means is

short text clustering with large dataset - user profiling

阅读更多关于 short text clustering with large dataset - user profiling

问题 Let me explain what I want to do: Input A csv file with millions of rows containing each one of them: id of the user and a string containing the list of keywords used by that user separated by spaces. The format of the second field, the string, is not so important, I can change that based on my needs, for example adding the counts of those keywords. The data comes from the Twitter database: users are Twitter users and keywords are "meaningful" words taken from their tweets (how is not

R combine rows with similar values

阅读更多关于 R combine rows with similar values

问题 I have a dataframe and the row values are first ordered from smallest to largest. I compute row value differences between adjacent rows, combine rows with similar differences (e.g., smaller than 1), and return averaged values of combined rows. I could check each row differences with a for loop, but seems a very inefficient way. Any better ideas? Thanks. library(dplyr) DF <- data.frame(ID=letters[1:12], Values=c(1, 2.2, 3, 5, 6.2, 6.8, 7, 8.5, 10, 12.2, 13, 14)) DF <- DF %>% mutate(Diff=c(0,

K means clustering mahout

阅读更多关于 K means clustering mahout

问题 I am trying to cluster a sample dataset which is in csv file format. But when I give the below command, user@ubuntu:/usr/local/mahout/trunk$ bin/mahout kmeans -i /root/Mahout/temp/parsedtext-seqdir-sparse-kmeans/tfidf-vectors/ -c /root/Mahout/temp/parsedtext-kmeans-clusters -o /root/Mahout/reuters21578/root/Mahout/temp/parsedtext-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 2 -k 1 -ow --clustering -cl I am getting the following error, saying there is no input clusters

Python K-means clustering on document [closed]

阅读更多关于 Python K-means clustering on document [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . Python code: subject1=['data mining','web mining','electronic engineering','cloud computing','Smart Biomaterials'，'Mathematical modeling'] subject2=['Computer Science','Engineering','Biology'] tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop