k-means

Sklearn Kmeans parameter confusion?

早过忘川 提交于 2019-12-19 07:22:01
问题 So I can run sklearn kmeans as the following: kmeans = KMeans(n_clusters=3,init='random',n_init=10,max_iter=500) But I'm a little confused on what the parameters mean so n_init says: Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. and max_iter says: Maximum number of iterations of the k-means algorithm for a single run. But I don't completely understand what that means. Is

Weighted Kmeans R

▼魔方 西西 提交于 2019-12-19 04:10:57
问题 I want to do a Kmeans clustering on a dataset (namely, Sample_Data) with three variables (columns) such as below: A B C 1 12 10 1 2 8 11 2 3 14 10 1 . . . . . . . . . . . . in a typical way, after scaling the columns, and determining the number of clusters, I will use this function in R: Sample_Data <- scale(Sample_Data) output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50) But, what if there is a preference for the variables? I mean that, suppose variable (column) A, is more

k-means: Same clusters for every execution

◇◆丶佛笑我妖孽 提交于 2019-12-19 03:22:49
问题 Is it possible to get same kmeans clusters for every execution for a particular data set. Just like for a random value we can use a fixed seed. Is it possible to stop randomness for clustering? 回答1: Yes. Use set.seed to set a seed for the random value before doing the clustering. Using the example in kmeans : set.seed(1) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(2) XX <- kmeans(x, 2) set.seed(2) YY

K-means with really large matrix

落爺英雄遲暮 提交于 2019-12-18 15:48:33
问题 I have to perform a k-means clustering on a really huge matrix (about 300.000x100.000 values which is more than 100Gb). I want to know if I can use R software to perform this or weka. My computer is a multiprocessor with 8Gb of ram and hundreds Gb of free space. I have enough space for calculations but loading such a matrix seems to be a problem with R (I don't think that using the bigmemory package would help me and big matrix use automatically all my RAM then my swap file if not enough

Spark MLLib Kmeans from dataframe, and back again

若如初见. 提交于 2019-12-18 11:56:32
问题 I aim to apply a kmeans clustering algorithm to a very large data set using Spark (1.3.1) MLLib. I have called the data from an HDFS using a hiveContext from Spark, and would eventually like to put it back there that way - in this format |I.D |cluster | =================== |546 |2 | |6534 |4 | |236 |5 | |875 |2 | I have ran the following code, where "data" is a dataframe of doubles, and an ID for the first column. val parsedData = data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))

How can I perform K-means clustering on time series data?

六月ゝ 毕业季﹏ 提交于 2019-12-18 10:37:17
问题 How can I do K-means clustering of time series data? I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update the mean of the cluster for time series data. I have a set of labelled time series, and I want to use the K-means algorithm to check whether I will get back a similar label or not. My X matrix will be N X M, where N is number of time series and M is

R kmeans initialization

流过昼夜 提交于 2019-12-18 04:29:09
问题 In the R programming environment, I am currently using the standard implementation of the kmeans algorithm (type: help(kmeans) ). It appears that I cannot initialize the starting centroids. I specify the kmeans algorithm to give me 4 clusters and I would like to pass the vector coordinates of the starting centroids. Is there an implementation of kmeans to allow me to pass initial centroid coordinates? 回答1: Yes. The implementation you mention allows you to specify starting positions. You pass

How to know which cluster do the new data belongs to after finishing cluster analysis

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-17 20:39:02
问题 After finishing cluster analysis,when I input some new data,how Do I know which cluster do the data belongs to? data(freeny) library(RSNNS) options(digits=2) year<-as.integer(rownames(freeny)) freeny<-cbind(freeny,year) freeny = freeny[sample(1:nrow(freeny),length(1:nrow(freeny))),1:ncol(freeny)] freenyValues= freeny[,1:5] freenyTargets=decodeClassLabels(freeny[,6]) freeny = splitForTrainingAndTest(freenyValues,freenyTargets,ratio=0.15) km<-kmeans(freeny$inputsTrain,10,iter.max = 100) kclust

Calculating the percentage of variance measure for k-means?

我的梦境 提交于 2019-12-17 17:27:07
问题 On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the distortion as they call it, is calculated. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in

How to optimal K in K - Means Algorithm [duplicate]

前提是你 提交于 2019-12-17 10:00:18
问题 This question already has answers here : Closed 8 years ago . Possible Duplicate: How do I determine k when using k-means clustering? How can i choose the K initially, if i do not know about the data? Can someone help me in choosing the K. Thanks Navin 回答1: The base idea is to evaluate cluster scoring on sample data, usally it is distance inside cluster and distance between clusters. The more this measure the better clustering, based on this mesure you can select best clustring paramters. One