cluster-analysis | 易学教程

Clustering - how to find the nearest to a cluster

阅读更多关于 Clustering - how to find the nearest to a cluster

问题 Hints I got as to a different question puzzled me quite a bit. I got an exercise, actually part of a larger exercise: Cluster some data, using hclust (done) Given a totally new vector, find out to which of the clusters you got in 1 it is nearest. According to the excercise, this should be done in quite short a time. However, after weeks I am puzzled whether this can be done at all, as apparently all I really get from hclust is a tree - and not, as I assumed, a number of clusters. As I suppose

Using scipy kmeans for cluster analysis

阅读更多关于 Using scipy kmeans for cluster analysis

问题 I want to understand scipy.cluster.vq.kmeans. Having a number of points distributed in 2D space, the problem is to group them into clusters. This problem came to my attention reading this question and I was thinking that scipy.cluster.vq.kmeans would be way to go. This is the data: Using the following code, the aim would be to get the center point of each of the 25 clusters. import numpy as np import matplotlib.pyplot as plt from scipy.cluster.vq import vq, kmeans, whiten pos = np.arange(0,20

Select the most dissimilar individual using cluster analysis

阅读更多关于 Select the most dissimilar individual using cluster analysis

问题 I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth. Data example: mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1)) What I did till now is clustering the

Select the most dissimilar individual using cluster analysis

阅读更多关于 Select the most dissimilar individual using cluster analysis

How to index with ELKI - OPTICS clustering

阅读更多关于 How to index with ELKI - OPTICS clustering

问题 I'm an ELKI beginner, and I've been using it to cluster around 10K lat-lon points from a .csv file. Once I get my settings correct, I'd like to scale up to 1MM points. I'm using the OPTICSXi algorithm with LngLatDistanceFunction I keep reading about "enabling R*-tree index with STR bulk loading" in order to see vast improvements in performance. The tutorials haven't helped me much. Any tips on how I can implement this feature? 回答1: The suggested parameters for using a spatial R* index on 2

Why is the cluster words' frequencies so small in a big dataset?

阅读更多关于 Why is the cluster words' frequencies so small in a big dataset?

问题 Referring to the question answered by @holzben Clustering: how to extract most distinguishing features? Using the SK-Means package, I managed to get the cluster. I couldn't figure out why the word frequency in all clusters is so small. It didn't make sense to me as I have about 10,000 tweets in my dataset. What am I doing wrong? My dataset is available at https://docs.google.com/a/siswa.um.edu.my/file/d/0B3-xuXnLwF0yTHAzbE5KbTlQWWM/edit > class(myCorpus) [1] "VCorpus" "Corpus" "list" > dtm<

Carrot2 workbench not able to process large data

阅读更多关于 Carrot2 workbench not able to process large data

问题 I wanted to cluster my data-set using carrot2 workbench. I have an input xml file with 65536 documents. I am using Lingo clustering algorithm. But when I start the process, the workbench returns the result within few seconds having all the documents in the "other topics" cluster. I have checked the clustering with smaller data-sets and I am getting the results. 回答1: Carrot2 Lingo algorithm was designed for small data sets, up to a thousand or so of documents. For larger data sets, you may

clustering 3D array in R

阅读更多关于 clustering 3D array in R

问题 I'm trying to cluster 3D data that I have in an array. It's actually information from a 3D image so this array represents a single image with x,y,z values. I would like to know what voxel tends to cluster with what. The array looks like this. dim(x) [1] 34 34 34 1 How can I go about this? I tried just plotting with scatterplot3d but it did not work. 回答1: So this is an attempt at clustering. You really should provide data if you want a better answer. library(reshape2) # for melt(...) library

Is sklearn.cluster.KMeans sensative to data point order?

阅读更多关于 Is sklearn.cluster.KMeans sensative to data point order?

问题 As noted in the answer to this post about feature scaling, some(all?) implementations of KMeans are sensitive to the order of features data points. Based on the sklearn.cluster.KMeans documentation, n_init only changes the initial position of the centroid. This would mean that one must loop over a few shuffles of features data points to test if this is a problem. My questions are as follows: Is the scikit-learn implementation sensitive to the ordering as the post suggest? Does n_init take

Hadoop Mahout Clustering

阅读更多关于 Hadoop Mahout Clustering

问题 I am trying to apply canopy clustering in Mahout. I already converted a text file into sequence file. But i cannot view the sequence file. Anyways I thought of applying canopy clustering by giving the following command, hduser@ubuntu:/usr/local/mahout/trunk$ mahout canopy -i /user/Hadoop/mahout_seq/seqdata -o /user/Hadoop/clustered_data -t1 5 -t2 3 I got the following error, 16/05/10 17:02:03 INFO mapreduce.Job: Task Id : attempt_1462850486830_0008_m_000000_1, Status : FAILED Error: java.lang