cluster-analysis

Color branches of dendrogram using an existing column

筅森魡賤 提交于 2020-01-21 08:52:27
问题 I have a data frame which I am trying to cluster. I am using hclust right now. In my data frame, there is a FLAG column which I would like to color the dendrogram by. By the resulting picture, I am trying to figure out similarities among various FLAG categories. My data frame looks something like this: FLAG ColA ColB ColC ColD I am clustering on colA , colB , colC and colD . I would like to cluster these and color them according to FLAG categories. Ex - color red if 1, blue if 0 (I have only

How Could One Implement the K-Means++ Algorithm?

末鹿安然 提交于 2020-01-19 06:33:12
问题 I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm. Is the probability function used based on distance or Gaussian? In the same time the most long distant point (From the other centroids) is picked for a new centroid. I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well

How Could One Implement the K-Means++ Algorithm?

大城市里の小女人 提交于 2020-01-19 06:32:05
问题 I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm. Is the probability function used based on distance or Gaussian? In the same time the most long distant point (From the other centroids) is picked for a new centroid. I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well

How can I have R utilize more of the processing power on my PC?

落花浮王杯 提交于 2020-01-17 07:20:34
问题 R version: 3.2.4 RStudio version: 0.99.893 Windows 7 Intel i7 480 GB RAM str(df) 161976 obs. of 11 variables I am a relative novice to R and do not have a software programming background. My task is to perform clustering on a data set. The variables have been scaled and centered. I am using the following code to find the optimal number of clusters: d <- dist(df, method = "euclidean") library(cluster) pamk.best <- pamk(d) plot(pam(d, pamk.best$nc)) I have noticed that the system never uses

Clustering F/OSS Library for .NET

戏子无情 提交于 2020-01-16 19:48:30
问题 Anyone aware of F/OSS library for clustering algorithms? Specifically interested in Hierarchical Clustering. Surely there are some libs available, not requiring writing up from scratch. p.s I know about NMath, it is $ ware 回答1: Decided to write my own. I will open-source it asap. 回答2: Have you tried NGrid? http://www.ohloh.net/p/ngrid And here's another "Legion" http://www.codeproject.com/KB/silverlight/gridcomputing.aspx 来源: https://stackoverflow.com/questions/2798662/clustering-f-oss

computing z-scores for 2D matrices in scipy/numpy in Python

天涯浪子 提交于 2020-01-14 03:15:13
问题 How can I compute the z-score for matrices in Python? Suppose I have the array: a = array([[ 1, 2, 3], [ 30, 35, 36], [2000, 6000, 8000]]) and I want to compute the z-score for each row. The solution I came up with is: array([zs(item) for item in a]) where zs is in scipy.stats.stats. Is there a better built-in vectorized way to do this? Also, is it always good to z-score numbers before using hierarchical clustering with euclidean or seuclidean distance? Can anyone discuss the relative

How can I match up cluster labels to my 'ground truth' labels in Matlab

倾然丶 夕夏残阳落幕 提交于 2020-01-13 13:28:12
问题 I have searched here and googled, but to no avail. When clustering in Weka there is a handy option, classes to clusters, which matches up the clusters produced by the algorithm e.g. simple k-means, to the 'ground truth' class labels you supply as the class attribute. So that we can see cluster accuracy (% incorrect). Now, how can I achieve this in Matlab, i.e. translate my clusterClasses vector e.g. [1, 1, 2, 1, 3, 2, 3, 1, 1, 1] into the same index as the supplied ground truth labels vector

Calculate ordering of dendrogram leaves

此生再无相见时 提交于 2020-01-13 06:01:08
问题 I have five points and I need to create dendrogram from these. The function 'dendrogram' can be used to find the ordering of these points as shown below. However, I do not want to use dendrogram as it is slow and result in error for large number of points (I asked this question here Python alternate way to find dendrogram). Can someone points me how to convert the 'linkage' output (Z) to the "dendrogram(Z)['ivl']" value. >>> from hcluster import pdist, linkage, dendrogram >>> import numpy >>>

clustering with NA values in R

坚强是说给别人听的谎言 提交于 2020-01-12 13:57:17
问题 I was surprised to find out that clara from library(cluster) allows NAs. But function documentation says nothing about how it handles these values. So my questions are: How clara handles NAs? Can this be somehow used for kmeans (Nas not allowed)? [Update] So I did found lines of code in clara function: inax <- is.na(x) valmisdat <- 1.1 * max(abs(range(x, na.rm = TRUE))) x[inax] <- valmisdat which do missing value replacement by valmisdat . Not sure I understand the reason to use such formula.

clustering with NA values in R

落花浮王杯 提交于 2020-01-12 13:57:15
问题 I was surprised to find out that clara from library(cluster) allows NAs. But function documentation says nothing about how it handles these values. So my questions are: How clara handles NAs? Can this be somehow used for kmeans (Nas not allowed)? [Update] So I did found lines of code in clara function: inax <- is.na(x) valmisdat <- 1.1 * max(abs(range(x, na.rm = TRUE))) x[inax] <- valmisdat which do missing value replacement by valmisdat . Not sure I understand the reason to use such formula.