cluster-analysis

Some questions on dendrogram - python (Scipy)

孤人 提交于 2020-01-03 16:45:11
问题 I am new to scipy but I managed to get the expected dendrogram. I am some more questions; In the dendrogram, distance between some points are 0 but its not visible due to image border. How can I remove the border and make the lower limit of y-axis to -1 , so that it is clearly visible. e.g. distance between these points are 0 (13,17), (2,10), (4,8,19) How can I prune/truncate on a particular distance. for e.g. prune at 0.4 How to write these clusters(after pruning) to a file My python code:

How to cluster data with discrete binary attributes?

左心房为你撑大大i 提交于 2020-01-03 04:40:15
问题 In my data, there are ten millions of binary attributes, But only some of them are informative, most of them are zeros. Format is like as following: data attribute1 attribute2 attribute3 attribute4 ......... A 0 1 0 1 ......... B 1 0 1 0 ......... C 1 1 0 1 ......... D 1 1 0 0 ......... What is a smart way to cluster this? I know K-means clustering. But I don't think it's suitable in this case. Because the binary value makes distances less obvious. And it will suffer form the curse of high

BOW prediction of cluster for training data

六眼飞鱼酱① 提交于 2020-01-03 03:28:24
问题 I am creating a bag of visual words for classification of videos. I am not using SURF descriptors and that is why I couldn't use OpenCV's BOWImgDescriptorExtractor for this purpose. I extracted my descriptors and I cluster them by myself. I have my vocabulary now (of size 4000). what I should do is to assign my training descriptors to these cluster and create visual histogram for next steps. how should I do this prediction and create visual histogram for my training data from the created

Clustering longitude and latitude gps data

房东的猫 提交于 2020-01-02 09:59:28
问题 I have more than 400 thousand cars GPS locations, like: [ 25.41452217, 37.94879532], [ 25.33231735, 37.93455887], [ 25.44327736, 37.96868896], ... I need to make spatial clustering with the distance between points <= 3 meters. I tried to use DBSCAN , but it seems that it is not working for geo(longitude, latitude) . Also, I do not know the number of clusters. 回答1: You can use pairwise_distances to calculate Geo distance from latitude / longitude and then pass the distance matrix into DBSCAN,

Python: computing pariwise distances causes memory error

做~自己de王妃 提交于 2020-01-02 09:36:45
问题 I want to compute the pairwise distances of 57832 vectors. Each vector has 200 dimensions. I am using pdist to compute the distances. from scipy.spatial.distance import pdist pairwise_distances = pdist(X, 'cosine') # pdist is supposed to return a numpy array with shape (57832*57831,). However, this causes a memory error. Traceback (most recent call last): File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/main.py", line 101, in <module> result

Clustering using a custom distance metric for lat/long pairs

随声附和 提交于 2020-01-01 09:21:17
问题 I'm trying to specify a custom clustering function for the scikit-learn DBSCAN implementation: def geodistance(latLngA, latLngB): print latLngA, latLngB return vincenty(latLngA, latLngB).miles cluster_labels = DBSCAN( eps=500, min_samples=max(2, len(found_geopoints)/10), metric=geodistance ).fit(np.array(found_geopoints)).labels_ However, when I print out the arguments to my distance function they aren't at all what I would expect: [ 0.53084126 0.19584111 0.99640966 0.88013373 0.33753788 0

Finding the center of a cluster

血红的双手。 提交于 2020-01-01 03:11:40
问题 I have the following problem - made abstract to bring out the key issues. I have 10 points each which is some distance from the other. I want to be able to find the center of the cluster i.e. the point for which the pairwise distance to each other point is minimised, let p(j) ~ p(k) represent the pairwise distance beteen points j and k p(i) is center-point of the cluster iff p(i) s.t. min[sum(p(j)~p(k))] for all 0 < j,k <= n where we have n points in the cluster determine how to split the

C/C++ Machine Learning Libraries for Clustering [closed]

旧巷老猫 提交于 2019-12-31 17:40:42
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . What are some C/c++ Machine learning libraries that supports clustering of multi dimensional data? (for example K-Means) So far I have come across SGI MLC++ http://www.sgi.com/tech/mlc/ OpenCV MLL I am tempted to roll-my-own, but I am sure pre-existing ones are far better performance optimized with more eyes on

C/C++ Machine Learning Libraries for Clustering [closed]

丶灬走出姿态 提交于 2019-12-31 17:39:20
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . What are some C/c++ Machine learning libraries that supports clustering of multi dimensional data? (for example K-Means) So far I have come across SGI MLC++ http://www.sgi.com/tech/mlc/ OpenCV MLL I am tempted to roll-my-own, but I am sure pre-existing ones are far better performance optimized with more eyes on

Online k-means clustering

丶灬走出姿态 提交于 2019-12-31 09:12:52
问题 Is there a online version of the k-Means clustering algorithm? By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when used in real time. I have wrote one my self with good results, but I would really prefer to have something "standardized" to refer to, since it is to be used in my master thesis. Also, does anyone have advice for other online clustering algorithms? (lmgtfy failed ;)) 回答1: Yes there is. Google