cluster-analysis | 易学教程

Parameter estimation in DBSCAN

阅读更多关于 Parameter estimation in DBSCAN

问题 I need to find naturally occurring classes of nouns based on their distribution with different preposition (like agentive, instrumental, time, place etc.). I tried using k-means clustering but of less help, it didn't work well, there was a lot of overlap over the classes that I was looking for (probably because of non-globular shape of classes and random initialisation in k-means). I am now working on using DBSCAN, but I have trouble understanding the epsilon value and mini-points value in

Parameter estimation in DBSCAN

阅读更多关于 Parameter estimation in DBSCAN

plot data structure as a tree in R

阅读更多关于 plot data structure as a tree in R

问题 I'm using sizetree() function from plotrix package to draw my data structure as a tree ( see below ) and it works just fine. However, I was wondering if there might be another way (or a package) that would provide a more elegant tree plot of the same data with the same information displayed? ( Note: In the below plot, fonts are unnecessarily either too big or too small so are the rectangles etc. also may be the plot could be inverted to get a better look.)-- it's subjective but I appreciate

Is K-means for clustering data with many zero values?

阅读更多关于 Is K-means for clustering data with many zero values?

问题 I need to cluster a matrix which contains mostly zeros values...Is K-means appropriate for these kind of data or do I need to consider a different algorithm? 回答1: No. The reason is that the mean is not sensible on sparse data. The resulting mean vectors will have very different characteristics than your actual data; they will often end up being more similar to each other than to actual documents! There are some modifications that improve k-means for sparse data such as spherical k-means. But

Is K-means for clustering data with many zero values?

阅读更多关于 Is K-means for clustering data with many zero values?

How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

阅读更多关于 How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

问题 I'm trying to cluster some text documents using scikit-learn . I'm trying out both DBSCAN and MeanShift and want to determine which hyperparameters (e.g. bandwidth for MeanShift and eps for DBSCAN) best work for the kind of data I'm using (news articles). I have some testing data which consists of pre-labeled clusters. I have been trying to use scikit-learn 's GridSearchCV but don't understand how (or if it can) be applied in this case, since it needs the test data to be split, but I want to

graphics window not working properly in `kml` package

阅读更多关于 graphics window not working properly in `kml` package

问题 I started working with the package kml to perform longitudinal cluster analysis. The package claims to have an interactive graphics window that lets you explore the clusterings found by kml . The window can be opened (according to the docs) by calling the function choice . But: That window does not open. Instead I get an error: Error in setGraphicsEventEnv(which, as.environment(list(...))) : this graphics device does not support event handling From the docs ?choice : At first, choice opens a

graphics window not working properly in `kml` package

阅读更多关于 graphics window not working properly in `kml` package

How can we show the trajectories belonging to clusters in `kml` package?

阅读更多关于 How can we show the trajectories belonging to clusters in `kml` package?

问题 The kml package implements k-means for longitudinal data. The clustering works just fine. Now I'm wondering how I can show the 'structure' of the clusters, for example, by coloring them. A most simple example from the docs (help file of the clusterLongData function..): library(kml) traj <- matrix(c(1,2,3,1,4, 3,6,1,8,10, 1,2,1,3,2, 4,2,5,6,3, 4,3,4,4,4, 7,6,5,5,4),6) myCld <- clusterLongData( traj=traj, idAll=as.character(c(100,102,103,109,115,123)), time=c(1,2,4,8,15), varNames="P", maxNA=3

Fast (< n^2) clustering algorithm

阅读更多关于 Fast (< n^2) clustering algorithm

问题 I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a specified radius). That means that there probably has to be many clusters of size 1. But! I need the running time to be well below n^2. n log n or so should be fine. The reason I'm doing this clustering is to avoid computing a distance matrix of all n points (which takes n^2 time or many hours),