cluster-analysis

How to cluster search engine keywords?

泪湿孤枕 提交于 2019-12-09 18:34:09
问题 From Google Analytics I have a (long) list of keywords that people used in search engines to find my website. I want to find the 'core keywords', hypothetical example: java online training learning java scala training training for java online training java learn scala programming The ideal result would be: 'java', 'online training', 'training', 'scala' and 'learn'. The difficulty seems to be detecting complete phrases, ignoring common words (for) and handling variations (learn-learning). Is

Dendrogram or Other Plot from Distance Matrix

廉价感情. 提交于 2019-12-09 18:03:08
问题 I have three matrices to compare. Each of them is 5x6. I originally wanted to use hierarchical clustering to cluster the matrices, such that the most similar matrices are grouped, given a threshold of similarity. I could not find any such functions in python, so I implemented the distance measure by hand, (p-norm where p=2). Now I have a 3x3 distance matrix (which I believe is also a similarity matrix in this case). I am now trying to produce a dendrogram. This is my code, and this is what is

Scipy's sparse eigsh() for small eigenvalues

巧了我就是萌 提交于 2019-12-09 16:14:06
问题 I'm trying to write a spectral clustering algorithm using NumPy/SciPy for larger (but still tractable) systems, making use of SciPy's sparse linear algebra library. Unfortunately, I'm running into stability issues with eigsh(). Here's my code: import numpy as np import scipy.sparse import scipy.sparse.linalg as SLA import sklearn.utils.graph as graph W = self._sparse_rbf_kernel(self.X_, self.datashape) D = scipy.sparse.csc_matrix(np.diag(np.array(W.sum(axis = 0))[0])) L = graph.graph

How to cluster an instance with Weka's DBSCAN?

最后都变了- 提交于 2019-12-09 15:37:22
问题 I've been trying to use the DBSCAN clusterer from Weka to cluster instances. From what I understand I should be using the clusterInstance() method for this, but to my surprise, when taking a look at the code of that method, it looks like the implementation ignores the parameter: /** * Classifies a given instance. * * @param instance The instance to be assigned to a cluster * @return int The number of the assigned cluster as an integer * @throws java.lang.Exception If instance could not be

k means clustering algorithm

白昼怎懂夜的黑 提交于 2019-12-09 13:51:13
问题 I want to perform a k means clustering analysis on a set of 10 data points that each have an array of 4 numeric values associated with them. I'm using the Pearson correlation coefficient as the distance metric. I did the first two steps of the k means clustering algorithm which were: 1) Select a set of initial centres of k clusters. [I selected two initial centres at random] 2) Assign each object to the cluster with the closest centre. [I used the Pearson correlation coefficient as the

Use Absolute Pearson Correlation as Distance in K-Means Algorithm (MATLAB)

江枫思渺然 提交于 2019-12-09 13:44:30
问题 I need to do some clustering using a correlation distance but instead of using the built-in 'distance' 'correlation' which is defined as d=1-r i need the absolute pearson distance.In my aplication anti-correlated data should get the same cluter ID. And now when using the kmeans() function im getting centroids that are highly anticorreleted wich i would like to avoid by combineing them. Now, im not that fluent in matlab yet and have some problems reading the kmeans function. Would it be

Effective clustering of a similarity matrix

人盡茶涼 提交于 2019-12-09 06:24:07
问题 my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php. 1) Similarity: I treat every document as a "bag-of-words" and convert words into vectors. I use filtering (only "real" words) tokenization (split sentences into words) stemming (reduce words to

Plotting the boundaries of cluster zone in Python with scikit package

江枫思渺然 提交于 2019-12-09 05:51:03
问题 Here is my simple example of dealing with data clustering in 3 attribute(x,y,value). each sample represent its location(x,y) and its belonging variable. My code was post here: x = np.arange(100,200,1) y = np.arange(100,200,1) value = np.random.random(100*100) xx,yy = np.meshgrid(x,y) xx = xx.reshape(100*100) yy = yy.reshape(100*100) j = np.dstack((xx,yy,value))[0,:,:] fig = plt.figure(figsize =(12,4)) ax1 = plt.subplot(121) xi,yi = np.meshgrid(x,y) va = value.reshape(100,100) pc = plt

Exporting dendrogram as table in R

拜拜、爱过 提交于 2019-12-09 04:55:06
问题 I would like to export an hclust-dendrogram from R into a data table in order to subsequently import it into another ("home-made") software. str(unclass(fit)) provides a text overview for the dendrogram, but what I'm looking for is really a numeric table. I've looked at the Bioconductor ctc package, but the output it's producing looks somewhat cryptical. I would like to have something similar to this table: http://stn.spotfire.com/spotfire_client_help/heat/heat_importing_exporting_dendrograms

Matlab cluster coding - plot scatter graph

北城以北 提交于 2019-12-09 03:25:21
问题 I have a daily annual energy consumption data set for a one year period. I would like to show a scatter graph of this data set separated into the four clusters which I expect exist (due to the differences of the four seasons) I understand that matlab cluster function can do this but my statistics is very rusty and I was hoping to get some guidance into which function is the best to use Thanks 回答1: Consider the following example of hierarchical clustering applied to the Fisher Iris dataset