cluster-analysis

K-means Plotting for 3 Dimensional Data

◇◆丶佛笑我妖孽 提交于 2019-12-05 03:55:19
问题 I'm working with k-means in MATLAB. I am trying to create the plot/graph, but my data has three dimensional array. Here is my k-means code: clc clear all close all load cobat.txt; % read the file k=input('Enter a number: '); % determine the number of cluster isRand=0; % 0 -> sequeantial initialization % 1 -> random initialization [maxRow, maxCol]=size(cobat); if maxRow<=k, y=[m, 1:maxRow]; elseif k>7 h=msgbox('cant more than 7'); else % initial value of centroid if isRand, p = randperm(size

Splitting data into two classes visually in matlab

自闭症网瘾萝莉.ら 提交于 2019-12-04 23:22:50
问题 I have two clusters of data each cluster has x,y (coordinates) and a value to know it's type(1 class1,2 class 2).I have plotted these data but i would like to split these classes with boundary(visually). what is the function to do such thing. i tried contour but it did not help! 回答1: Consider this classification problem (using the Iris dataset): As you can see, except for easily separable clusters for which you know the equation of the boundary beforehand, finding the boundary is not a

Cluster quality measures

守給你的承諾、 提交于 2019-12-04 23:17:49
问题 Does Matlab provide any facility for evaluating clustering methods? (cluster compactness and cluster separation. ....) Or is there any toolbox for it? 回答1: Not in Matlab, but ELKI (Java) provides a dozen or so cluster quality measures for evaluation. 回答2: Matlab provides Silhouette index and there is a toolbox CVAP: Cluster Validity Analysis Platform for Matlab. Which includes following validity indexes: Davies-Bouldin Calinski-Harabasz Dunn index R-squared index Hubert-Levin (C-index)

Optimal way to cluster set of strings with hamming distance [duplicate]

无人久伴 提交于 2019-12-04 22:25:52
This question already has an answer here: Fast computation of pairs with least hamming distance 1 answer Finding Minimum hamming distance of a set of strings in python 4 answers I have a database with n strings (n > 1 million), each string has 100 chars, each char is either a , b , c or d . I would like to find the closest strings for each one , closest defines as having the smallest hamming distance . I would like to find the k-nearest strings for each one (k < 5). Example N = 5 i1 = aacbdbbb i2 = abcbdbbb i3 = bbcadabd i4 = bbcadabb HammingDistance(i1,i2) = 1 HammingDistance(i1,i3) = 5

Grid search for hyperparameter evaluation of clustering in scikit-learn

最后都变了- 提交于 2019-12-04 22:21:33
I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which works fine. My problem here is that I don't need to use the cross-validation aspect of the GridSearchCV / RandomizedSearchCV , but I can't find a simple GridSearch / RandomizedSearch . I can write my own but the ParameterSampler and ParameterGrid objects are very useful. My next step will be to subclass BaseSearchCV and implement my own _fit() method, but thought it was worth asking is there a simpler

Document Clustering Basics

[亡魂溺海] 提交于 2019-12-04 21:43:31
So, I've been mulling over these concepts for some time, and my understanding is very basic. Information retrieval seems to be a topic seldom covered in the wild... My questions stem from the process of clustering documents. Let's say I start off with a collection of documents containing only interesting words. What is the first step here? Parse the words from each document and create a giant 'bag-of-words' type model? Do I then proceed to create vectors of word counts for each document? How do I compare these documents using something like the K-means clustering? Try Tf-idf for starters. If

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

烈酒焚心 提交于 2019-12-04 17:24:37
问题 I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form: When I am executing k <- kmeans(norm,center=3) I am receving the following error: Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) Can you help me? Thank you! 回答1: kmeans cannot handle data that has NA values. The mean and variance are then no longer well defined, and you don't know anymore which center is closest. 回答2: Error in do_one(nmeth) : NA/NaN/Inf

How to evaluate the best K for LDA using Mallet?

烈酒焚心 提交于 2019-12-04 16:13:50
I am using Mallet api to extract topic from twitter data and I have already extracted topics which are seems good topic. But I am facing problem to estimating K. For example I fixed K value from 10 to 100. So, I have taken different number of topics from the data. But, now I would like to estimate which K is best. There are some algorithm I know as Perplexity Empirical likelihood Marginal likelihood (Harmonic mean method) Silhouette I found a method model.estimate() which may be used to estimate with different value of K. But I am not getting any idea to show the value of K is best for the

Calculate ordering of dendrogram leaves

江枫思渺然 提交于 2019-12-04 15:03:04
I have five points and I need to create dendrogram from these. The function 'dendrogram' can be used to find the ordering of these points as shown below. However, I do not want to use dendrogram as it is slow and result in error for large number of points (I asked this question here Python alternate way to find dendrogram ). Can someone points me how to convert the 'linkage' output (Z) to the "dendrogram(Z)['ivl']" value. >>> from hcluster import pdist, linkage, dendrogram >>> import numpy >>> from numpy.random import rand >>> x = rand(5,3) >>> Y = pdist(x) >>> Z = linkage(Y) >>> Z array([[ 1.

how do I cluster a list of geographic points by distance?

扶醉桌前 提交于 2019-12-04 14:03:37
I have a list of points P=[p1,...pN] where pi=(latitudeI,longitudeI). Using Python 3, I would like to find a smallest set of clusters (disjoint subsets of P) such that every member of a cluster is within 20km of every other member in the cluster. Distance between two points is computed using the Vincenty method . To make this a little more concrete, suppose I have a set of points such as from numpy import * points = array([[33. , 41. ], [33.9693, 41.3923], [33.6074, 41.277 ], [34.4823, 41.919 ], [34.3702, 41.1424], [34.3931, 41.078 ], [34.2377, 41.0576], [34.2395, 41.0211], [34.4443, 41.3499],