k-means

ELKI Kmeans clustering Task failed error for high dimensional data

岁酱吖の 提交于 2019-12-02 12:30:32
I have a 60000 documents which i processed in gensim and got a 60000*300 matrix. I exported this as a csv file. When i import this in ELKI environment and run Kmeans clustering, i am getting below error. Task failed de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126) at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81) at

How to show total number in same coordinate in R Programming

馋奶兔 提交于 2019-12-02 12:05:30
(update 11/09/2017 question) this is my codes to cluster kmodes in R: library(klaR) setwd("D:/kmodes") data.to.cluster <- read.csv('kmodes.csv', header = TRUE, sep = ';') cluster.results <- kmodes(data.to.cluster[,2:5], 3, iter.max = 10, weighted = FALSE) plot(data.to.cluster[,2:5],col= cluster.results$cluster) the result is like this image : http://imgur.com/a/Y46yJ My sample data : https://drive.google.com/file/d/0B-Z58iD3By5wUzduOXUwUDh1OVU/view Is there a way to show total number in same coordinate? I mean when clustering if there are many value which is same as 1,1 (x,y) could we make r

cluster labels and cluster centers (kmeans in R)

♀尐吖头ヾ 提交于 2019-12-02 11:48:13
问题 I am extemely new to R and trying to deal with a kmeans object. Ideally what I would like to do is to take the list of cluster labels for each point in my data and replace the label with the corresponding center. Essentially, ending up with a matrix where each data point is represented by the value of the center of the cluster it has been placed into by kmeans. Is there a way to do this efficiently instead of going through each entry manually and replacing the cluster label with the cluster

Compute Cost of Kmeans

人盡茶涼 提交于 2019-12-02 11:43:04
问题 I am using this model, which is not written by me. In order to predict the centroids I had to do this: model = cPickle.load(open("/tmp/model_centroids_128d_pkl.lopq")) codes = d.map(lambda x: (x[0], model.predict_coarse(x[1]))) where `d.first()' yields this: (u'3768915289', array([ -86.00641097, -100.41325623, <128 coords in total>])) and codes.first() : (u'3768915289', (5657, 7810)) How can I computeCost() of this KMeans model? After reading train_model.py, I am trying like this: In [23]:

How to assign an new observation to existing Kmeans clusters based on nearest cluster centriod logic in python?

こ雲淡風輕ζ 提交于 2019-12-02 11:13:41
I used the below code to create k-means clusters using Scikit learn. kmean = KMeans(n_clusters=nclusters,n_jobs=-1,random_state=2376,max_iter=1000,n_init=1000,algorithm='full',init='k-means++') kmean_fit = kmean.fit(clus_data) I also have saved the centroids using kmean_fit.cluster_centers_ I then pickled the K means object. filename = pickle_path+'\\'+'_kmean_fit.sav' pickle.dump(kmean_fit, open(filename, 'wb')) So that I can load the same kmeans pickle object and apply it to new data when it comes, using kmean_fit.predict(). Questions : Will the approach of loading kmeans pickle object and

How can I prevent NAN issues?

試著忘記壹切 提交于 2019-12-02 09:20:30
I'm getting Mean of empty slice runtime warnings. When I print out what my variables are (numpy arrays), several of them contain nan values. The Runtime Warning is looking at line 58 as the issue. What can I change to make it work? Sometimes the program will run with no issues. Most times it does not. This is a K-Means from scratch algorithm that is clustering the iris data set. It first prompts the users for the amount of centroids they want (clusters). It then randomly generates said number of clusters in the given range from the numbers in the loaded in text file. I have the break value in

MATLAB K-means聚类代码讲解

血红的双手。 提交于 2019-12-02 08:09:54
一、概述 K -means聚类采用类内距离和最小的方式对数据分类,MATLAB中自带K-means算法,最简单的调用如下: idx=kmeans(x,k) 将 n -by- p 数据矩阵 x 中的数据划分为 k 个类簇。 x 的行对应数据条数, x 的列对应数据的维度。注意:当x是向量时,kmeans将其视为n乘1数据矩阵,而不管其方向如何。kmeans返回一个n乘1向量idx,其中包含每个点的簇索引。默认情况下,kmeans使用平方欧氏距离。 二、K-means参数 典型的带参数的 K -means调用如下: [ ... ] = kmeans(..., 'PARAM1',val1, 'PARAM2',val2, ...) 由param和val构成参数键值对进行控制,常用的参数有: 1 'Distance' - 距离度量, 在P维空间中, K- means应该最小化的距离度量 'sqeuclidean' - 平方欧氏距离(默认值) 'cityblock' - 绝对差之和,即L1距离 'cosine' - 1减去点之间夹角的余弦 'correlation' - 1减去点之间的样本相关性 'hamming' - 不同位的百分比 2 'Start' - 选择初始簇质心位置的方法 'plus' - 默认值。根据k-means++算法从X中选择K个观测值。第一个聚类中心从X中随机地选择

cluster labels and cluster centers (kmeans in R)

狂风中的少年 提交于 2019-12-02 06:14:27
I am extemely new to R and trying to deal with a kmeans object. Ideally what I would like to do is to take the list of cluster labels for each point in my data and replace the label with the corresponding center. Essentially, ending up with a matrix where each data point is represented by the value of the center of the cluster it has been placed into by kmeans. Is there a way to do this efficiently instead of going through each entry manually and replacing the cluster label with the cluster center value? Thanks! Ben Is this what you're after? Extended from this answer : # make some data x <-

How to display the row name in K means cluster plot in R?

梦想与她 提交于 2019-12-02 05:18:01
I am trying to plot the K-means cluster. The below is the code i use. library(cluster) library(fpc) data(iris) dat <- iris[, -5] # without known classification # Kmeans clustre analysis clus <- kmeans(dat, centers=3) clusplot(dat, clus$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) I get the below picture: Instead of the row numbers, I want it displayed with a row name in characters. I understand this picture is producing had the data like the below: Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3

Pyspark - ValueError: could not convert string to float / invalid literal for float()

风格不统一 提交于 2019-12-02 03:05:00
I am trying to use data from a spark dataframe as the input for my k-means model. However I keep getting errors. (Check section after code) My spark dataframe and looks like this (and has around 1M rows): ID col1 col2 Latitude Longitude 13 ... ... 22.2 13.5 62 ... ... 21.4 13.8 24 ... ... 21.8 14.1 71 ... ... 28.9 18.0 ... ... ... .... .... Here is my code: from pyspark.ml.clustering import KMeans from pyspark.ml.linalg import Vectors df = spark.read.csv("file.csv") spark_rdd = df.rdd.map(lambda row: (row["ID"], Vectors.dense(row["Latitude"],row["Longitude"]))) feature_df = spark_rdd.toDF(["ID