cluster-analysis | 易学教程

How to cluster search engine keywords?

阅读更多关于 How to cluster search engine keywords?

问题 From Google Analytics I have a (long) list of keywords that people used in search engines to find my website. I want to find the 'core keywords', hypothetical example: java online training learning java scala training training for java online training java learn scala programming The ideal result would be: 'java', 'online training', 'training', 'scala' and 'learn'. The difficulty seems to be detecting complete phrases, ignoring common words (for) and handling variations (learn-learning). Is

Dendrogram or Other Plot from Distance Matrix

阅读更多关于 Dendrogram or Other Plot from Distance Matrix

问题 I have three matrices to compare. Each of them is 5x6. I originally wanted to use hierarchical clustering to cluster the matrices, such that the most similar matrices are grouped, given a threshold of similarity. I could not find any such functions in python, so I implemented the distance measure by hand, (p-norm where p=2). Now I have a 3x3 distance matrix (which I believe is also a similarity matrix in this case). I am now trying to produce a dendrogram. This is my code, and this is what is

Scipy's sparse eigsh() for small eigenvalues

阅读更多关于 Scipy's sparse eigsh() for small eigenvalues

问题 I'm trying to write a spectral clustering algorithm using NumPy/SciPy for larger (but still tractable) systems, making use of SciPy's sparse linear algebra library. Unfortunately, I'm running into stability issues with eigsh(). Here's my code: import numpy as np import scipy.sparse import scipy.sparse.linalg as SLA import sklearn.utils.graph as graph W = self._sparse_rbf_kernel(self.X_, self.datashape) D = scipy.sparse.csc_matrix(np.diag(np.array(W.sum(axis = 0))[0])) L = graph.graph

How to cluster an instance with Weka's DBSCAN?

阅读更多关于 How to cluster an instance with Weka's DBSCAN?

问题 I've been trying to use the DBSCAN clusterer from Weka to cluster instances. From what I understand I should be using the clusterInstance() method for this, but to my surprise, when taking a look at the code of that method, it looks like the implementation ignores the parameter: /** * Classifies a given instance. * * @param instance The instance to be assigned to a cluster * @return int The number of the assigned cluster as an integer * @throws java.lang.Exception If instance could not be

k means clustering algorithm

阅读更多关于 k means clustering algorithm

问题 I want to perform a k means clustering analysis on a set of 10 data points that each have an array of 4 numeric values associated with them. I'm using the Pearson correlation coefficient as the distance metric. I did the first two steps of the k means clustering algorithm which were: 1) Select a set of initial centres of k clusters. [I selected two initial centres at random] 2) Assign each object to the cluster with the closest centre. [I used the Pearson correlation coefficient as the

Use Absolute Pearson Correlation as Distance in K-Means Algorithm (MATLAB)

阅读更多关于 Use Absolute Pearson Correlation as Distance in K-Means Algorithm (MATLAB)

问题 I need to do some clustering using a correlation distance but instead of using the built-in 'distance' 'correlation' which is defined as d=1-r i need the absolute pearson distance.In my aplication anti-correlated data should get the same cluter ID. And now when using the kmeans() function im getting centroids that are highly anticorreleted wich i would like to avoid by combineing them. Now, im not that fluent in matlab yet and have some problems reading the kmeans function. Would it be

Effective clustering of a similarity matrix

阅读更多关于 Effective clustering of a similarity matrix

问题 my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php. 1) Similarity: I treat every document as a "bag-of-words" and convert words into vectors. I use filtering (only "real" words) tokenization (split sentences into words) stemming (reduce words to

Plotting the boundaries of cluster zone in Python with scikit package

阅读更多关于 Plotting the boundaries of cluster zone in Python with scikit package

问题 Here is my simple example of dealing with data clustering in 3 attribute(x,y,value). each sample represent its location(x,y) and its belonging variable. My code was post here: x = np.arange(100,200,1) y = np.arange(100,200,1) value = np.random.random(100*100) xx,yy = np.meshgrid(x,y) xx = xx.reshape(100*100) yy = yy.reshape(100*100) j = np.dstack((xx,yy,value))[0,:,:] fig = plt.figure(figsize =(12,4)) ax1 = plt.subplot(121) xi,yi = np.meshgrid(x,y) va = value.reshape(100,100) pc = plt

Exporting dendrogram as table in R

阅读更多关于 Exporting dendrogram as table in R

问题 I would like to export an hclust-dendrogram from R into a data table in order to subsequently import it into another ("home-made") software. str(unclass(fit)) provides a text overview for the dendrogram, but what I'm looking for is really a numeric table. I've looked at the Bioconductor ctc package, but the output it's producing looks somewhat cryptical. I would like to have something similar to this table: http://stn.spotfire.com/spotfire_client_help/heat/heat_importing_exporting_dendrograms

Matlab cluster coding - plot scatter graph

阅读更多关于 Matlab cluster coding - plot scatter graph

问题 I have a daily annual energy consumption data set for a one year period. I would like to show a scatter graph of this data set separated into the four clusters which I expect exist (due to the differences of the four seasons) I understand that matlab cluster function can do this but my statistics is very rusty and I was hoping to get some guidance into which function is the best to use Thanks 回答1: Consider the following example of hierarchical clustering applied to the Fisher Iris dataset