cluster-analysis | 易学教程

Some questions on dendrogram - python (Scipy)

阅读更多关于 Some questions on dendrogram - python (Scipy)

问题 I am new to scipy but I managed to get the expected dendrogram. I am some more questions; In the dendrogram, distance between some points are 0 but its not visible due to image border. How can I remove the border and make the lower limit of y-axis to -1 , so that it is clearly visible. e.g. distance between these points are 0 (13,17), (2,10), (4,8,19) How can I prune/truncate on a particular distance. for e.g. prune at 0.4 How to write these clusters(after pruning) to a file My python code:

How to cluster data with discrete binary attributes?

阅读更多关于 How to cluster data with discrete binary attributes?

问题 In my data, there are ten millions of binary attributes, But only some of them are informative, most of them are zeros. Format is like as following: data attribute1 attribute2 attribute3 attribute4 ......... A 0 1 0 1 ......... B 1 0 1 0 ......... C 1 1 0 1 ......... D 1 1 0 0 ......... What is a smart way to cluster this? I know K-means clustering. But I don't think it's suitable in this case. Because the binary value makes distances less obvious. And it will suffer form the curse of high

BOW prediction of cluster for training data

阅读更多关于 BOW prediction of cluster for training data

问题 I am creating a bag of visual words for classification of videos. I am not using SURF descriptors and that is why I couldn't use OpenCV's BOWImgDescriptorExtractor for this purpose. I extracted my descriptors and I cluster them by myself. I have my vocabulary now (of size 4000). what I should do is to assign my training descriptors to these cluster and create visual histogram for next steps. how should I do this prediction and create visual histogram for my training data from the created

Clustering longitude and latitude gps data

阅读更多关于 Clustering longitude and latitude gps data

问题 I have more than 400 thousand cars GPS locations, like: [ 25.41452217, 37.94879532], [ 25.33231735, 37.93455887], [ 25.44327736, 37.96868896], ... I need to make spatial clustering with the distance between points <= 3 meters. I tried to use DBSCAN , but it seems that it is not working for geo(longitude, latitude) . Also, I do not know the number of clusters. 回答1: You can use pairwise_distances to calculate Geo distance from latitude / longitude and then pass the distance matrix into DBSCAN,

Python: computing pariwise distances causes memory error

阅读更多关于 Python: computing pariwise distances causes memory error

问题 I want to compute the pairwise distances of 57832 vectors. Each vector has 200 dimensions. I am using pdist to compute the distances. from scipy.spatial.distance import pdist pairwise_distances = pdist(X, 'cosine') # pdist is supposed to return a numpy array with shape (57832*57831,). However, this causes a memory error. Traceback (most recent call last): File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/main.py", line 101, in <module> result

Clustering using a custom distance metric for lat/long pairs

阅读更多关于 Clustering using a custom distance metric for lat/long pairs

问题 I'm trying to specify a custom clustering function for the scikit-learn DBSCAN implementation: def geodistance(latLngA, latLngB): print latLngA, latLngB return vincenty(latLngA, latLngB).miles cluster_labels = DBSCAN( eps=500, min_samples=max(2, len(found_geopoints)/10), metric=geodistance ).fit(np.array(found_geopoints)).labels_ However, when I print out the arguments to my distance function they aren't at all what I would expect: [ 0.53084126 0.19584111 0.99640966 0.88013373 0.33753788 0

Finding the center of a cluster

阅读更多关于 Finding the center of a cluster

问题 I have the following problem - made abstract to bring out the key issues. I have 10 points each which is some distance from the other. I want to be able to find the center of the cluster i.e. the point for which the pairwise distance to each other point is minimised, let p(j) ~ p(k) represent the pairwise distance beteen points j and k p(i) is center-point of the cluster iff p(i) s.t. min[sum(p(j)~p(k))] for all 0 < j,k <= n where we have n points in the cluster determine how to split the

C/C++ Machine Learning Libraries for Clustering [closed]

阅读更多关于 C/C++ Machine Learning Libraries for Clustering [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . What are some C/c++ Machine learning libraries that supports clustering of multi dimensional data? (for example K-Means) So far I have come across SGI MLC++ http://www.sgi.com/tech/mlc/ OpenCV MLL I am tempted to roll-my-own, but I am sure pre-existing ones are far better performance optimized with more eyes on

C/C++ Machine Learning Libraries for Clustering [closed]

阅读更多关于 C/C++ Machine Learning Libraries for Clustering [closed]

Online k-means clustering

阅读更多关于 Online k-means clustering

问题 Is there a online version of the k-Means clustering algorithm? By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when used in real time. I have wrote one my self with good results, but I would really prefer to have something "standardized" to refer to, since it is to be used in my master thesis. Also, does anyone have advice for other online clustering algorithms? (lmgtfy failed ;)) 回答1: Yes there is. Google