cluster-analysis

Greedy clustering algorithm speed improvement

五迷三道 提交于 2019-12-20 05:43:07
问题 I am trying to implement a very simple greedy clustering algorithm in python, but am hard-pressed to optimize it for speed. The algorithm will take a distance matrix, find the column with the most components less than a predetermined distance cutoff, and store the row indices (with components less than the cutoff) as the members of the cluster. The centroid of the cluster is the column index. The columns and rows of each member index are then removed from the distance matrix (resulting in a

Using ELKI on custom objects and making sense of results

你离开我真会死。 提交于 2019-12-20 05:00:24
问题 I am trying to use ELKI's SLINK implementation of hierarchical clustering in my program. I have a set of objects (of my own type) that need to be clustered. For that, I convert them to feature vectors before clustering. This is how I currently got it to run and produce some result (code is in Scala): val clusterer = new SLINK(CosineDistanceFunction.STATIC, 3) val connection = new ArrayAdapterDatabaseConnection(featureVectors) val database = new StaticArrayDatabase(connection, null) database

Clustering: how to extract most distinguishing features?

Deadly 提交于 2019-12-19 11:57:38
问题 I have a set of documents that I am trying to cluster based on their vocabulary (that is, first making a corpus and then a sparse matrix with the DocumentTermMatrix command and so on). To improve the clusters and to understand better what features/words make a particular document fall into a particular cluster, I would like to know what the most distinguishing features for each cluster are. There is an example of this in the Machine Learning with R book by Lantz, if you happen to know it - he

Spark KMeans clustering: get the number of sample assigned to a cluster

◇◆丶佛笑我妖孽 提交于 2019-12-19 09:09:16
问题 I am using Spark Mlib for kmeans clustering. I have a set of vectors from which I want to determine the most likely cluster center. So I will run kmeans clustering training on this set and select the cluster with the highest number of vector assigned to it. Therefore I need to know the number of vectors assigned to each cluster after training (i.e KMeans.run(...)). But I can not find a way to retrieve this information from KMeanModel result. I probably need to run predict on all training

How to spread out community graph made by using igraph package in R

北慕城南 提交于 2019-12-19 05:47:20
问题 Trying to find communities in tweet data. The cosine similarity between different words forms the adjacency matrix. Then, I created graph out of that adjacency matrix. Visualization of the graph is the task here: # Document Term Matrix dtm = DocumentTermMatrix(tweets) ### adjust threshold here dtms = removeSparseTerms(dtm, 0.998) dim(dtms) # cosine similarity matrix t = as.matrix(dtms) # comparing two word feature vectors #cosine(t[,"yesterday"], t[,"yet"]) numWords = dim(t)[2] # cosine

How to find cluster sizes in 2D numpy array?

柔情痞子 提交于 2019-12-19 05:25:47
问题 My problem is the following, I have a 2D numpy array filled with 0 an 1, with an absorbing boundary condition (all the outer elements are 0) , for example: [[0 0 0 0 0 0 0 0 0 0] [0 0 1 0 0 0 0 0 0 0] [0 0 1 0 1 0 0 0 1 0] [0 0 0 0 0 0 1 0 1 0] [0 0 0 0 0 0 1 0 0 0] [0 0 0 0 1 0 1 0 0 0] [0 0 0 0 0 1 1 0 0 0] [0 0 0 1 0 1 0 0 0 0] [0 0 0 0 1 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0]] I want to create a function that takes this array and its linear dimension L as input parameters, (in this case L = 10)

How to find cluster sizes in 2D numpy array?

孤者浪人 提交于 2019-12-19 05:25:13
问题 My problem is the following, I have a 2D numpy array filled with 0 an 1, with an absorbing boundary condition (all the outer elements are 0) , for example: [[0 0 0 0 0 0 0 0 0 0] [0 0 1 0 0 0 0 0 0 0] [0 0 1 0 1 0 0 0 1 0] [0 0 0 0 0 0 1 0 1 0] [0 0 0 0 0 0 1 0 0 0] [0 0 0 0 1 0 1 0 0 0] [0 0 0 0 0 1 1 0 0 0] [0 0 0 1 0 1 0 0 0 0] [0 0 0 0 1 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0]] I want to create a function that takes this array and its linear dimension L as input parameters, (in this case L = 10)

Weighted Kmeans R

▼魔方 西西 提交于 2019-12-19 04:10:57
问题 I want to do a Kmeans clustering on a dataset (namely, Sample_Data) with three variables (columns) such as below: A B C 1 12 10 1 2 8 11 2 3 14 10 1 . . . . . . . . . . . . in a typical way, after scaling the columns, and determining the number of clusters, I will use this function in R: Sample_Data <- scale(Sample_Data) output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50) But, what if there is a preference for the variables? I mean that, suppose variable (column) A, is more

k-means: Same clusters for every execution

◇◆丶佛笑我妖孽 提交于 2019-12-19 03:22:49
问题 Is it possible to get same kmeans clusters for every execution for a particular data set. Just like for a random value we can use a fixed seed. Is it possible to stop randomness for clustering? 回答1: Yes. Use set.seed to set a seed for the random value before doing the clustering. Using the example in kmeans : set.seed(1) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(2) XX <- kmeans(x, 2) set.seed(2) YY

Text clustering using Scipy Hierarchy Clustering in Python

て烟熏妆下的殇ゞ 提交于 2019-12-18 18:27:11
问题 I have a text corpus that contains 1000+ articles each in a separate line. I am trying to use Hierarchy Clustering using Scipy in python to produce clusters of related articles. This is the code I used to do the clustering # Agglomerative Clustering import matplotlib.pyplot as plt import scipy.cluster.hierarchy as hac tree = hac.linkage(X.toarray(), method="complete",metric="euclidean") plt.clf() hac.dendrogram(tree) plt.show() and I got this plot Then I cut off the tree at the third level