cluster-analysis

Cosine distance as vector distance function for k-means

我只是一个虾纸丫 提交于 2019-12-03 11:52:16
I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited. E.g. for the graph: the vector: v1 = {100, 50, 0 30, 0} would mean that we spent: 100secs at vertex 1 50secs at vertex 2 and 30secs at vertex 4 (vertices 3 & 5 where not visited, thus the 0s). I want to run a k-means clustering and I've chosen cosine_distance = 1 - cosine_similarity as the metric for the distances, where the formula for cosine

In R, how can I plot a similarity matrix (like a block graph) after clustering data?

自古美人都是妖i 提交于 2019-12-03 11:17:16
问题 I want to produce a graph that shows a correlation between clustered data and similarity matrix. How can I do this in R? Is there any function in R that creates the graph like a picture in this link? http://bp0.blogger.com/_VCI4AaOLs-A/SG5H_jm-f8I/AAAAAAAAAJQ/TeLzUEWbb08/s400/Similarity.gif (just googled and got the link that shows a graph that I want to produce) Thanks, in advance. 回答1: The general solutions suggested in the comments by @Chase and @bill_080 need a little bit of enhancement

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

岁酱吖の 提交于 2019-12-03 11:14:50
I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form: When I am executing k <- kmeans(norm,center=3) I am receving the following error: Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) Can you help me? Thank you! kmeans cannot handle data that has NA values. The mean and variance are then no longer well defined, and you don't know anymore which center is closest. Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) This error occurs also due to non numeric values present in the table. all of

Which programming structure for clustering algorithm

与世无争的帅哥 提交于 2019-12-03 09:19:17
问题 I am trying to implement the following (divisive) clustering algorithm (below is presented short form of the algorithm, the full description is available here): Start with a sample x, i = 1, ..., n regarded as a single cluster of n data points and a dissimilarity matrix D defined for all pairs of points. Fix a threshold T for deciding whether or not to split a cluster. First determine the distance between all pairs of data points and choose a pair with the largest distance (Dmax) between them

Algorithm to decide cut-off for collapsing this tree?

家住魔仙堡 提交于 2019-12-03 09:06:52
问题 I have a Newick tree that is built by comparing similarity (euclidean distance) of Position Weight Matrices (PWMs or PSSMs) of putative DNA regulatory motifs that are 4-9 bp long DNA sequences. An interactive version of the tree is up on iTol (here), which you can freely play with - just press "update tree" after setting your parameters: My specific goal: to collapse the motifs (tips/terminal nodes/leaves) together if their average distances to the nearest parent clade is < X (ETE2 Python

Plotting the boundaries of cluster zone in Python with scikit package

杀马特。学长 韩版系。学妹 提交于 2019-12-03 08:42:26
Here is my simple example of dealing with data clustering in 3 attribute(x,y,value). each sample represent its location(x,y) and its belonging variable. My code was post here: x = np.arange(100,200,1) y = np.arange(100,200,1) value = np.random.random(100*100) xx,yy = np.meshgrid(x,y) xx = xx.reshape(100*100) yy = yy.reshape(100*100) j = np.dstack((xx,yy,value))[0,:,:] fig = plt.figure(figsize =(12,4)) ax1 = plt.subplot(121) xi,yi = np.meshgrid(x,y) va = value.reshape(100,100) pc = plt.pcolormesh(xi,yi,va,cmap = plt.cm.Spectral) plt.colorbar(pc) ax2 = plt.subplot(122) y_pred = KMeans(n_clusters

Is Triangle inequality necessary for kmeans?

扶醉桌前 提交于 2019-12-03 08:33:25
I wonder if Triangle inequality is necessary for the distance measure used in kmeans. k-means is designed for Euclidean distance, which happens to satisfy triangle inequality. Using other distance functions is risky, as it may stop converging . The reason however is not the triangle inequality, but the mean might not minimize the distance function . (The arithmetic mean minimizes the sum-of-squares, not arbitrary distances!) There are faster methods for k-means that exploit the triangle inequality to avoid recomputations. But if you stick to classic MacQueen or Lloyd k-means, then you do not

hierarchical clustering on correlations in Python scipy/numpy?

三世轮回 提交于 2019-12-03 08:26:41
问题 How can I run hierarchical clustering on a correlation matrix in scipy / numpy ? I have a matrix of 100 rows by 9 columns, and I'd like to hierarchically cluster by correlations of each entry across the 9 conditions. I'd like to use 1-pearson correlation as the distances for clustering. Assuming I have a numpy array X that contains the 100 x 9 matrix, how can I do this? I tried using hcluster, based on this example: Y=pdist(X, 'seuclidean') Z=linkage(Y, 'single') dendrogram(Z, color_threshold

How to use 'hclust' as function call in R

久未见 提交于 2019-12-03 07:58:48
问题 I tried to construct the clustering method as function the following ways: mydata <- mtcars # Here I construct hclust as a function hclustfunc <- function(x) hclust(as.matrix(x),method="complete") # Define distance metric distfunc <- function(x) as.dist((1-cor(t(x)))/2) # Obtain distance d <- distfunc(mydata) # Call that hclust function fit<-hclustfunc(d) # Later I'd do # plot(fit) But why it gives the following error: Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed

Effective clustering of a similarity matrix

喜欢而已 提交于 2019-12-03 07:53:25
my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php. 1) Similarity: I treat every document as a "bag-of-words" and convert words into vectors. I use filtering (only "real" words) tokenization (split sentences into words) stemming (reduce words to their base form; Porter's stemmer) pruning (cut of words with too high & low frequency) as methods for