cluster-analysis | 易学教程

Greedy clustering algorithm speed improvement

阅读更多关于 Greedy clustering algorithm speed improvement

问题 I am trying to implement a very simple greedy clustering algorithm in python, but am hard-pressed to optimize it for speed. The algorithm will take a distance matrix, find the column with the most components less than a predetermined distance cutoff, and store the row indices (with components less than the cutoff) as the members of the cluster. The centroid of the cluster is the column index. The columns and rows of each member index are then removed from the distance matrix (resulting in a

Using ELKI on custom objects and making sense of results

阅读更多关于 Using ELKI on custom objects and making sense of results

问题 I am trying to use ELKI's SLINK implementation of hierarchical clustering in my program. I have a set of objects (of my own type) that need to be clustered. For that, I convert them to feature vectors before clustering. This is how I currently got it to run and produce some result (code is in Scala): val clusterer = new SLINK(CosineDistanceFunction.STATIC, 3) val connection = new ArrayAdapterDatabaseConnection(featureVectors) val database = new StaticArrayDatabase(connection, null) database

Clustering: how to extract most distinguishing features?

阅读更多关于 Clustering: how to extract most distinguishing features?

问题 I have a set of documents that I am trying to cluster based on their vocabulary (that is, first making a corpus and then a sparse matrix with the DocumentTermMatrix command and so on). To improve the clusters and to understand better what features/words make a particular document fall into a particular cluster, I would like to know what the most distinguishing features for each cluster are. There is an example of this in the Machine Learning with R book by Lantz, if you happen to know it - he

Spark KMeans clustering: get the number of sample assigned to a cluster

阅读更多关于 Spark KMeans clustering: get the number of sample assigned to a cluster

问题 I am using Spark Mlib for kmeans clustering. I have a set of vectors from which I want to determine the most likely cluster center. So I will run kmeans clustering training on this set and select the cluster with the highest number of vector assigned to it. Therefore I need to know the number of vectors assigned to each cluster after training (i.e KMeans.run(...)). But I can not find a way to retrieve this information from KMeanModel result. I probably need to run predict on all training

How to spread out community graph made by using igraph package in R

阅读更多关于 How to spread out community graph made by using igraph package in R

问题 Trying to find communities in tweet data. The cosine similarity between different words forms the adjacency matrix. Then, I created graph out of that adjacency matrix. Visualization of the graph is the task here: # Document Term Matrix dtm = DocumentTermMatrix(tweets) ### adjust threshold here dtms = removeSparseTerms(dtm, 0.998) dim(dtms) # cosine similarity matrix t = as.matrix(dtms) # comparing two word feature vectors #cosine(t[,"yesterday"], t[,"yet"]) numWords = dim(t)[2] # cosine

How to find cluster sizes in 2D numpy array?

阅读更多关于 How to find cluster sizes in 2D numpy array?

问题 My problem is the following, I have a 2D numpy array filled with 0 an 1, with an absorbing boundary condition (all the outer elements are 0) , for example: [[0 0 0 0 0 0 0 0 0 0] [0 0 1 0 0 0 0 0 0 0] [0 0 1 0 1 0 0 0 1 0] [0 0 0 0 0 0 1 0 1 0] [0 0 0 0 0 0 1 0 0 0] [0 0 0 0 1 0 1 0 0 0] [0 0 0 0 0 1 1 0 0 0] [0 0 0 1 0 1 0 0 0 0] [0 0 0 0 1 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0]] I want to create a function that takes this array and its linear dimension L as input parameters, (in this case L = 10)

How to find cluster sizes in 2D numpy array?

阅读更多关于 How to find cluster sizes in 2D numpy array?

Weighted Kmeans R

阅读更多关于 Weighted Kmeans R

问题 I want to do a Kmeans clustering on a dataset (namely, Sample_Data) with three variables (columns) such as below: A B C 1 12 10 1 2 8 11 2 3 14 10 1 . . . . . . . . . . . . in a typical way, after scaling the columns, and determining the number of clusters, I will use this function in R: Sample_Data <- scale(Sample_Data) output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50) But, what if there is a preference for the variables? I mean that, suppose variable (column) A, is more

k-means: Same clusters for every execution

阅读更多关于 k-means: Same clusters for every execution

问题 Is it possible to get same kmeans clusters for every execution for a particular data set. Just like for a random value we can use a fixed seed. Is it possible to stop randomness for clustering? 回答1: Yes. Use set.seed to set a seed for the random value before doing the clustering. Using the example in kmeans : set.seed(1) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(2) XX <- kmeans(x, 2) set.seed(2) YY

Text clustering using Scipy Hierarchy Clustering in Python

阅读更多关于 Text clustering using Scipy Hierarchy Clustering in Python

问题 I have a text corpus that contains 1000+ articles each in a separate line. I am trying to use Hierarchy Clustering using Scipy in python to produce clusters of related articles. This is the code I used to do the clustering # Agglomerative Clustering import matplotlib.pyplot as plt import scipy.cluster.hierarchy as hac tree = hac.linkage(X.toarray(), method="complete",metric="euclidean") plt.clf() hac.dendrogram(tree) plt.show() and I got this plot Then I cut off the tree at the third level