k-means

How to do column wise intersection with itertools

爱⌒轻易说出口 提交于 2019-12-12 01:47:27
问题 When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix. I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for the same. Find the sample of the dataset below: ID AGE Occupation Gender Product_range Product_cat Product 1100 25-34 IT M 50-60 Gaming XPS 6610 1101 35-44

How to find cluster centers of Spark's StreamingKMeans?

不想你离开。 提交于 2019-12-12 01:37:24
问题 When I use Spark's KMeansModel class, I can easily access the centroids of my model's clusters using the KMeansModel.clusterCenters() function. I wanted to use StreamingKMeans, but I noticed that it seems to lack a clusterCenters() function. Is there a way to obtain the centroids of my model's clusters in StreamingKMeans? 回答1: In batch KMeans, an estimator is trained once and produces a single transformer - the model which contains the clusterCenters() method. In StreamingKMeans, a model is

How to determine the K value for k-means algorithm? [duplicate]

怎甘沉沦 提交于 2019-12-12 00:38:31
问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: How do I determine k when using k-means clustering? How can we determine the value of K(the number of clusters) for the k-means algorithm ?? 回答1: Sometimes. There are various methods, that usually require trying different values of k and measuring which worked best. Here are some duplicate questions you missed: How to optimal K in K - Means Algorithm K-Means Algorithm Kmeans without knowing the number of

Associating region index with true labels

不问归期 提交于 2019-12-11 17:49:43
问题 The documentation is somewhat vague about this whereas I would've thought it'd be a pretty straight-forward thing to implement. The k_mean algorithm applied to the MNIST digit dataset outputs 10 regions with a certain number associated with it, though it isn't the number represented by most of the digits contained within that region. I do have my ground_truth label table. How do I make it so that each region generated by the k_mean algorithm ends up being labeled as the digit which has the

K-Means 初始质心的选择

筅森魡賤 提交于 2019-12-11 16:01:47
1.随机选择 选择初始质心,我们可以用最基本的随机方法,但是这种方法会导致一个局部最优解问题。即,将一个比较大的簇分裂,同时将两个较小的簇进行合并。 由于K-Means算法具有不稳定性,初始质心选择不同,结果也不同。所以解决局部最优的方法,其一可以多次运行算法,选择具有最小SSE值的那组作为最终解。这种方法通过多次运行,通过尝试,来解决随机选择初始质心问题。 不过可以通过以下其他方法来寻找比较好的初始质心。 2.层次聚类 通过层次聚类,划分k个层次,计算出每个簇对应的质心作为K-Means算法的初始质心。这种方法可以很好地解决初始质心指派不合理的问题。但是也有局限性。 3.K-Means++ K-Means++算法是基本算法的改进版,其区别就在于初始质心的选择。 该算法第一个质心是随机选择的,接下来的质心基于样本点与最近质心的距离,距离越大越可能被选为下一个质心,直到选择完k个质心。 该方法有效地解决了关于初始质心的选取问题,目前已经成为了一种硬聚类算法的标准。但是该方法无法解决离群点问题。 4.基于最近邻密度 该方法通过检测样本点的样本密度和与之前质心的分散度来决定下一个质心。 链接:https://www.jianshu.com/p/4f8c097e26a8 来源: https://www.cnblogs.com/yoyowin/p/12022776.html

Matlab k-means cosine assigns everything to one cluster

笑着哭i 提交于 2019-12-11 11:09:14
问题 I'm using Matlab's regular kmeans algorithm with 'Distance','cosine','EmptyAction','drop' on an L2-normalized feature matrix and I have a problem. The output that Matlab generates is simply assigning EVERY datapoint to cluster 1.00000 , even if k=20, and all centroids in C are NaN . Does anyone have any suggestions as to what might be causing this? The layout of the matrix is ([0,1,...,1,0,1],[...],[0,1,...,1,0,1]). I've done the L2-normalization using Python's numpy.linalg.norm before I

Clustering of images to evaluate diversity (Weka?)

五迷三道 提交于 2019-12-11 10:43:24
问题 Within a university course I have some features of images (as text files). I have to rank those images according to their diversity.# The idea I have in mind is to feed a k-means classifier with the images and then compute the euclidian-distance from the images within a cluster to the cluster's centroïd. Then do a rotation between clusters and take always the (next) closest image to the centroïd. I.e., return closest to centroïd 1, then closest to centroïd 2, then 3.... then second closest to

Cut off point in k-means clustering in sas

好久不见. 提交于 2019-12-11 08:57:51
问题 So I want to classify my data into clusters with cut-off point in SAS. The method I use is k-means clustering. (I don't mind about the method, as long as, it gives me 3 groups.) My code for clustering: proc fastclus data=maindat outseed=seeds1 maxcluster =3 maxiter=0; var value resid; run; I have the problem with the output result. I want the cut-off point for the Value to be include in the output file. (I don't want the cut-off point for Resid). So is there anyway to do this in SAS? Edit: As

Kmeans clustering using jaccard distance matrix

这一生的挚爱 提交于 2019-12-11 07:14:39
问题 I'm trying to create Jaccard distance matrix and perform K-means on it to give out cluster ids and the ids of elements in the cluster. The input for it is twitter tweets. The following is the code and i couldn't understand how to use initial seeds from a file for kmeans. install.packages("rjson" ,dependencies=TRUE) library("rjson") install.packages("jsonlite" ,dependencies=TRUE) library("jsonlite") install.packages("stringdist" ,dependencies=TRUE) library("stringdist") data <- fromJSON

k* reproduction values?

試著忘記壹切 提交于 2019-12-11 05:26:53
问题 I am reading about Product Quantization, from section II.A page 3 of PQ for NNS, that says: ..all subquantizers have the same finite number k* of reproduction values. In that case the number of centroids is (k*)^m where m is the number of subvectors. However, I do not get k* at all! I mean in vector quantization we assign every vector to k centroids. In produce quantization, we assign every subvector to k centroids. How did k* come into play? 回答1: I think k* is the number of centroids in each