cluster-analysis

Python: String clustering with scikit-learn's dbscan, using Levenshtein distance as metric:

十年热恋 提交于 2020-01-12 04:40:09
问题 I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a similarity metric, along with dbscan as the clustering algorithm as k-means algorithms won't work because I do not know the number of clusters. I am facing some problems using Scikit-learn's implementation of dbscan. This snippet below works on small datasets in the format I an using, but since it is precomputing the

3D clustering Algorithm

三世轮回 提交于 2020-01-11 16:30:42
问题 Problem Statement: I have the following problem: There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points. Goal: To find an algorithm that can scale well to many

3D clustering Algorithm

纵饮孤独 提交于 2020-01-11 16:27:20
问题 Problem Statement: I have the following problem: There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points. Goal: To find an algorithm that can scale well to many

Clustering: Cluster validation

江枫思渺然 提交于 2020-01-11 12:57:11
问题 I want to use some clustering method for large social network dataset. The problem is how to evaluate the clustering method. yes, I can use some external ,internal and relative cluster validation methods. I used Normalized mutual information(NMI) as external validation method for cluster validation based on synthetic data. I produced some synthetic dataset by producing 5 clusters with equal number of nodes and some strongly connected links inside each cluster and weak links between clusters

how do I cluster a list of geographic points by distance?

被刻印的时光 ゝ 提交于 2020-01-11 03:26:04
问题 I have a list of points P=[p1,...pN] where pi=(latitudeI,longitudeI). Using Python 3, I would like to find a smallest set of clusters (disjoint subsets of P) such that every member of a cluster is within 20km of every other member in the cluster. Distance between two points is computed using the Vincenty method. To make this a little more concrete, suppose I have a set of points such as from numpy import * points = array([[33. , 41. ], [33.9693, 41.3923], [33.6074, 41.277 ], [34.4823, 41.919

Sklearn : Mean Distance from Centroid of each cluster

隐身守侯 提交于 2020-01-11 01:44:08
问题 How can i find the mean distance from the centroid to all the data points in each cluster. I am able to find the euclidean distance of each point (in my dataset) from the centroid of each cluster. Now i want to find the mean distance from centroid to all the data points in each cluster. What is a good way of calculating mean distance from each centroid ? So far I have done this.. def k_means(self): data = pd.read_csv('hdl_gps_APPLE_20111220_130416.csv', delimiter=',') combined_data = data

MiniBatchKMeans gives different centroids after subsequent iterations

巧了我就是萌 提交于 2020-01-07 02:54:53
问题 I am using the MiniBatchKMeans model from the sklearn.cluster module in anaconda. I am clustering a data-set that contains approximately 75,000 points. It looks something like this: data = np.array([8,3,1,17,5,21,1,7,1,26,323,16,2334,4,2,67,30,2936,2,16,12,28,1,4,190...]) I fit the data using the process below. from sklearn.cluster import MiniBatchKMeans kmeans = MiniBatchKMeans(batch_size=100) kmeans.fit(data.reshape(-1,1) This is all well and okay, and I proceed to find the centroids of the

how to calculate massive dissimilarity matrix in R

拜拜、爱过 提交于 2020-01-06 14:01:05
问题 I am currently working on clustering some big data, about 30k rows, the dissimilarity matrix just too big for R to handle, I think this is not purely memory size problem. Maybe there are some smart way to do this? 回答1: If your data is so large that base R can't easily cope, then you have several options: Work on a machine with more RAM. Use a commercial product, e.g. Revolution Analytics that supports working with larger data with R. Here is an example using RevoScaleR the commercial package

how to calculate massive dissimilarity matrix in R

自闭症网瘾萝莉.ら 提交于 2020-01-06 14:00:42
问题 I am currently working on clustering some big data, about 30k rows, the dissimilarity matrix just too big for R to handle, I think this is not purely memory size problem. Maybe there are some smart way to do this? 回答1: If your data is so large that base R can't easily cope, then you have several options: Work on a machine with more RAM. Use a commercial product, e.g. Revolution Analytics that supports working with larger data with R. Here is an example using RevoScaleR the commercial package

I want to calculate each column's sample deviation in data

↘锁芯ラ 提交于 2020-01-06 12:39:04
问题 I am doing cluster analysis based on data "college" which consists of 3 nominal and 20 numeric variables. # select the columns based on the clustering results cluster_1 <- mat[which(groups==1),] #"cluster_1" is a data set which is made by cluster analysis consisting of 125 observations. rbind(cluster_1[, -(1:3)], colMeans(cluster_1[, -(1:3)])) #This is process of calculating each column's mean and attach the means to the bottom of the data set, "cluster_1". Now what I want to know is how to