cluster-analysis | 易学教程

Online k-means clustering

阅读更多关于 Online k-means clustering

问题 Is there a online version of the k-Means clustering algorithm? By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when used in real time. I have wrote one my self with good results, but I would really prefer to have something "standardized" to refer to, since it is to be used in my master thesis. Also, does anyone have advice for other online clustering algorithms? (lmgtfy failed ;)) 回答1: Yes there is. Google

Using a smoother with the L Method to determine the number of K-Means clusters

阅读更多关于 Using a smoother with the L Method to determine the number of K-Means clusters

问题 Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use? The "L-Method" is detailed in: Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, Salvador & Chan This calculates the evaluation metric

How to find the optimal point for DBSCAN() parameters in R

阅读更多关于 How to find the optimal point for DBSCAN() parameters in R

问题 How to find the optimal point and appropriate amount for DBSCAN() parameters(eps,Minpts)? DBSCAN() from package fpc implements the DBSCAN(Density based clustering) clustering method. 回答1: You can find strategies for choosing minPts and epsilon discussed in the original DBSCAN paper: Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD (Vol. 96, No. 34, pp. 226-231). Also read up on some

How to find the optimal point for DBSCAN() parameters in R

阅读更多关于 How to find the optimal point for DBSCAN() parameters in R

Clustering Strings Based on Similar Word Sequences

阅读更多关于 Clustering Strings Based on Similar Word Sequences

问题 I am looking for an efficient way to cluster about 10 million strings into clusters based on the appearance of similar word sequences. Consider a list of strings like: the fruit hut number one the ice cre am shop number one jim's taco ice cream shop in the corner the ice cream shop the fruit hut jim's taco outlet number one jim's t aco in the corner the fruit hut in the corner After the algorithm runs on them I want them clustered as follows: the ice cre am shop number one ice cream shop in

What clustering algorithm to use on 1-d data? [closed]

阅读更多关于 What clustering algorithm to use on 1-d data? [closed]

问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 6 years ago . I have a list of numbers in an array. The index of each element is X and the value is Y. How do i go about partitioning/clustering this data? If i had an array, i just want a set of values which mark the end of

Colour the tick lables in a dendrogram to match the cluster colours

阅读更多关于 Colour the tick lables in a dendrogram to match the cluster colours

问题 How can I individually colour the labels of a dendrogram so that they match the colours of the clusters in MATLAB? Here is an example desired output generated using the code in my answer below (note the lables are just the 50 charater series 'A':'r' ): If there is a more straightforward way to do this, please do post an answer as I was unable to find the solution to this by googling. If not, the code is below for posterity. 回答1: I could not find a definitive answer to this but I managed to

How to assign an new observation to existing Kmeans clusters based on nearest cluster centriod logic in python?

阅读更多关于 How to assign an new observation to existing Kmeans clusters based on nearest cluster centriod logic in python?

问题 I used the below code to create k-means clusters using Scikit learn. kmean = KMeans(n_clusters=nclusters,n_jobs=-1,random_state=2376,max_iter=1000,n_init=1000,algorithm='full',init='k-means++') kmean_fit = kmean.fit(clus_data) I also have saved the centroids using kmean_fit.cluster_centers_ I then pickled the K means object. filename = pickle_path+'\\'+'_kmean_fit.sav' pickle.dump(kmean_fit, open(filename, 'wb')) So that I can load the same kmeans pickle object and apply it to new data when

How to generate a 'clusterable' dataset in MATLAB

阅读更多关于 How to generate a 'clusterable' dataset in MATLAB

问题 I need to test my Gap Statistics algorithm (which should tell me the optimum k for the dataset) and in order to do so I need to generate a big dataset easily clustarable, so that I know a priori the optimum number of clusters. Do you know any fast way to do it? 回答1: It very much depends on what kind of dataset you expect - 1D, 2D, 3D, normal distribution, sparse, etc? And how big is "big"? Thousands, millions, billions of observations? Anyway, my general approach to creating easy-to-identify

How to color a dendrogram's labels according to defined groups? (in R)

阅读更多关于 How to color a dendrogram's labels according to defined groups? (in R)

问题 I have a numeric matrix in R with 24 rows and 10,000 columns. The row names of this matrix are basically file names from which I have read the data corresponding to each of the 24 rows. Apart from this I have a separate factor list with 24 entires, specifying the group to which the 24 files belong. There are 3 groups - Alcohols, Hydrocarbon and Ester. The names and the corresponding group to which they belong look like this: > MS.mz [1] "int-354.19" "int-361.35" "int-368.35" "int-396.38" "int