cluster-analysis

Clustering - how to find the nearest to a cluster

喜夏-厌秋 提交于 2019-12-25 18:44:36
问题 Hints I got as to a different question puzzled me quite a bit. I got an exercise, actually part of a larger exercise: Cluster some data, using hclust (done) Given a totally new vector, find out to which of the clusters you got in 1 it is nearest. According to the excercise, this should be done in quite short a time. However, after weeks I am puzzled whether this can be done at all, as apparently all I really get from hclust is a tree - and not, as I assumed, a number of clusters. As I suppose

Using scipy kmeans for cluster analysis

拟墨画扇 提交于 2019-12-25 17:45:18
问题 I want to understand scipy.cluster.vq.kmeans. Having a number of points distributed in 2D space, the problem is to group them into clusters. This problem came to my attention reading this question and I was thinking that scipy.cluster.vq.kmeans would be way to go. This is the data: Using the following code, the aim would be to get the center point of each of the 25 clusters. import numpy as np import matplotlib.pyplot as plt from scipy.cluster.vq import vq, kmeans, whiten pos = np.arange(0,20

Select the most dissimilar individual using cluster analysis

北慕城南 提交于 2019-12-25 17:38:06
问题 I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth. Data example: mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1)) What I did till now is clustering the

Select the most dissimilar individual using cluster analysis

Deadly 提交于 2019-12-25 17:38:03
问题 I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth. Data example: mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1)) What I did till now is clustering the

How to index with ELKI - OPTICS clustering

跟風遠走 提交于 2019-12-25 14:24:10
问题 I'm an ELKI beginner, and I've been using it to cluster around 10K lat-lon points from a .csv file. Once I get my settings correct, I'd like to scale up to 1MM points. I'm using the OPTICSXi algorithm with LngLatDistanceFunction I keep reading about "enabling R*-tree index with STR bulk loading" in order to see vast improvements in performance. The tutorials haven't helped me much. Any tips on how I can implement this feature? 回答1: The suggested parameters for using a spatial R* index on 2

Why is the cluster words' frequencies so small in a big dataset?

柔情痞子 提交于 2019-12-25 12:56:12
问题 Referring to the question answered by @holzben Clustering: how to extract most distinguishing features? Using the SK-Means package, I managed to get the cluster. I couldn't figure out why the word frequency in all clusters is so small. It didn't make sense to me as I have about 10,000 tweets in my dataset. What am I doing wrong? My dataset is available at https://docs.google.com/a/siswa.um.edu.my/file/d/0B3-xuXnLwF0yTHAzbE5KbTlQWWM/edit > class(myCorpus) [1] "VCorpus" "Corpus" "list" > dtm<

Carrot2 workbench not able to process large data

空扰寡人 提交于 2019-12-25 04:25:02
问题 I wanted to cluster my data-set using carrot2 workbench. I have an input xml file with 65536 documents. I am using Lingo clustering algorithm. But when I start the process, the workbench returns the result within few seconds having all the documents in the "other topics" cluster. I have checked the clustering with smaller data-sets and I am getting the results. 回答1: Carrot2 Lingo algorithm was designed for small data sets, up to a thousand or so of documents. For larger data sets, you may

clustering 3D array in R

别说谁变了你拦得住时间么 提交于 2019-12-25 02:55:13
问题 I'm trying to cluster 3D data that I have in an array. It's actually information from a 3D image so this array represents a single image with x,y,z values. I would like to know what voxel tends to cluster with what. The array looks like this. dim(x) [1] 34 34 34 1 How can I go about this? I tried just plotting with scatterplot3d but it did not work. 回答1: So this is an attempt at clustering. You really should provide data if you want a better answer. library(reshape2) # for melt(...) library

Is sklearn.cluster.KMeans sensative to data point order?

若如初见. 提交于 2019-12-24 19:30:03
问题 As noted in the answer to this post about feature scaling, some(all?) implementations of KMeans are sensitive to the order of features data points. Based on the sklearn.cluster.KMeans documentation, n_init only changes the initial position of the centroid. This would mean that one must loop over a few shuffles of features data points to test if this is a problem. My questions are as follows: Is the scikit-learn implementation sensitive to the ordering as the post suggest? Does n_init take

Hadoop Mahout Clustering

╄→гoц情女王★ 提交于 2019-12-24 17:25:35
问题 I am trying to apply canopy clustering in Mahout. I already converted a text file into sequence file. But i cannot view the sequence file. Anyways I thought of applying canopy clustering by giving the following command, hduser@ubuntu:/usr/local/mahout/trunk$ mahout canopy -i /user/Hadoop/mahout_seq/seqdata -o /user/Hadoop/clustered_data -t1 5 -t2 3 I got the following error, 16/05/10 17:02:03 INFO mapreduce.Job: Task Id : attempt_1462850486830_0008_m_000000_1, Status : FAILED Error: java.lang