cluster-analysis

Optimal way to cluster set of strings with hamming distance [duplicate]

隐身守侯 提交于 2019-12-06 16:46:29
问题 This question already has answers here : Fast computation of pairs with least hamming distance (1 answer) Finding Minimum hamming distance of a set of strings in python (4 answers) Closed 4 years ago . I have a database with n strings (n > 1 million), each string has 100 chars, each char is either a , b , c or d . I would like to find the closest strings for each one , closest defines as having the smallest hamming distance . I would like to find the k-nearest strings for each one (k < 5).

Should one use distances (dissimilarities) or similarities in R for clustering?

爱⌒轻易说出口 提交于 2019-12-06 14:55:31
I'm doing a cluster problem, and the proxy package in R provides both dist and simil functions. For my purpose I need a distance matrix, so I initially used dist, and here's the code: distanceMatrix <- dist(dfm[,-1], method='Pearson') clusters <- hclust(distanceMatrix) clusters$labels <- dfm[,1]#colnames(dfm)[-1] plot(clusters, labels=clusters$labels) But after I ploted the image I found that the cluster result is not the way I expecte it to be, since I know what it should look like. So I tried simil instead, and the code is like: distanceMatrix <- simil(dfm[,-1], method='Pearson') clusters <-

Output from 'choice' in R's kml

冷暖自知 提交于 2019-12-06 14:49:46
I'm having trouble getting 'choice' to create output. When the graphical interface launches, I am selecting a partition with the space bar. This creates a black circle around the partition, indicating it has been selected. When I click 'return', nothing happens. I checked my working directory to look for the output files, but they are not there. I used getwd() to ensure that I have the correct setwd(). No dice. There was a similar question posted: Exporting result from kml package in R ; however, the answer does not work for me. Any suggestions? I am using R 3.1.0 GUI Mavericks build(6734) and

Hierarchical Agglomerative clustering in Spark

佐手、 提交于 2019-12-06 14:16:55
I am working on a clustering problem and it has to be scalable for a lot of data. I would like to try hierarchical clustering in Spark and compare my results with other methods. I have done some research on the web about using hierarchical clustering with Spark but haven't found any promising information. If anyone has some insight about it, I would be very grateful. Thank you. Gabe Church The Bisecting Kmeans Approach Seems to do a decent job, and runs quite fast in terms of performance. Here is a sample code I wrote for utilizing the Bisecting-Kmeans algorithm in Spark (scala) to get cluster

Document Clustering Basics

丶灬走出姿态 提交于 2019-12-06 14:02:28
问题 So, I've been mulling over these concepts for some time, and my understanding is very basic. Information retrieval seems to be a topic seldom covered in the wild... My questions stem from the process of clustering documents. Let's say I start off with a collection of documents containing only interesting words. What is the first step here? Parse the words from each document and create a giant 'bag-of-words' type model? Do I then proceed to create vectors of word counts for each document? How

Plotting Clusters using clusplot with coordinates centered around 0

…衆ロ難τιáo~ 提交于 2019-12-06 13:35:37
I am trying to plot GIS coordinates, specifically UK national Grid Coordinates which eastings and northings ressemble: 194630000 562220000 I can plot these using clusplot in the Cluster library: clusplot (df2,k.means.fit$cluster,main=i,color=TRUE,shade=FALSE,labels=0,lines=0,bty="7") where df2 is my data frame and k.means.fit is the result of the K means analysis on df2. Note that the coordinates of the centers after the k means analysis have not been normalised: k.means.fit$centers # Grid.Ref.Northing Grid.Ref.Easting #1 206228234 581240726 But when I plot the clusters, all the points are

Dumping clustering result with vectors names

回眸只為那壹抹淺笑 提交于 2019-12-06 11:14:08
I have created my Vectors as described in this question and have run mahout kmeans on the data. Since I'm using Mahout 0.7, the clusterdump command didn't work as described in Mahout in Action, but I got it to work like this: export HADOOP_CLASSPATH=/path/to/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar:/path/to/mahout-distribution-0.7/integration/target/mahout-integration-0.7.jar hadoop jar core/target/mahout-core-0.7-job.jar org.apache.mahout.utils.clustering.ClusterDumper -i /clustering/out/clusters-20-final -o textout -of TEXT and I am getting lines like this one: VL-1383471

Number clustering/partitioning algorithm

让人想犯罪 __ 提交于 2019-12-06 10:57:05
I have an ordered 1-D array of numbers. Both the array length and the values of the numbers in the array are arbitrary. I want to partition the array into k partitions, according to the number values, e.g. let's say I want 4 partitions, distributed as 30% / 30% / 20% / 20%, i.e. the top 30% values first, the next 30% afterwards, etc. I get to choose k and the percentages of the distribution. In addition, if the same number appears more than once in the array, it should not be contained in two different partitions. This means that the distribution percentages above are not strict, but rather

How to evaluate the best K for LDA using Mallet?

给你一囗甜甜゛ 提交于 2019-12-06 09:55:04
问题 I am using Mallet api to extract topic from twitter data and I have already extracted topics which are seems good topic. But I am facing problem to estimating K. For example I fixed K value from 10 to 100. So, I have taken different number of topics from the data. But, now I would like to estimate which K is best. There are some algorithm I know as Perplexity Empirical likelihood Marginal likelihood (Harmonic mean method) Silhouette I found a method model.estimate() which may be used to

Python: computing pariwise distances causes memory error

▼魔方 西西 提交于 2019-12-06 09:17:34
I want to compute the pairwise distances of 57832 vectors. Each vector has 200 dimensions. I am using pdist to compute the distances. from scipy.spatial.distance import pdist pairwise_distances = pdist(X, 'cosine') # pdist is supposed to return a numpy array with shape (57832*57831,). However, this causes a memory error. Traceback (most recent call last): File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/main.py", line 101, in <module> result_clustering = clf_clustering.getCVResult(shuffle) File "/home/munichong/git/DomainClassification/NameSuggestion