cluster-analysis

PySpark ML: Get KMeans cluster statistics

扶醉桌前 提交于 2019-12-08 03:54:29
问题 I have built a KMeansModel. My results are stored in a PySpark DataFrame called transformed . (a) How do I interpret the contents of transformed ? (b) How do I create one or more Pandas DataFrame from transformed that would show summary statistics for each of the 13 features for each of the 14 clusters? from pyspark.ml.clustering import KMeans # Trains a k-means model. kmeans = KMeans().setK(14).setSeed(1) model = kmeans.fit(X_spark_scaled) # Fits a model to the input dataset with optional

Visualize data and clustering [closed]

半腔热情 提交于 2019-12-08 02:38:00
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . i am currently writing a python script to find the similarity between documents.I have already calculated the similarities score for each document pairs and store them in dictionaries. It looks something like this: {(8328, 8327): 1.0, (8313, 8306): 0.12405229825691289, (8329, 8328): 1.0, (8322, 8321): 0

Show rows on clustered kmeans data

二次信任 提交于 2019-12-08 01:50:26
问题 Hi I was wondering when you cluster data on the figure screen is there a way to show which rows the data points belong to when you scroll over them? From the picture above I was hoping there would be a way in which if I select or scroll over the points that I could tell which row it belonged to. Here is the code: %% dimensionality reduction columns = 6 [U,S,V]=svds(fulldata,columns); %% randomly select dataset rows = 1000; columns = 6; %# pick random rows indX = randperm( size(fulldata,1) );

Exporting result from kml package in R

て烟熏妆下的殇ゞ 提交于 2019-12-08 01:11:27
I'm using a kml package of R to cluster my data and I need to get in the end a csv file with a column including the number of clusters according to each id. The data has many missing values, so I can't use kmeans function without deleting all observations, but kml works nicely with that. My problem is that I use choice() to export the results and all I get is a graphical window, but no output files. Here is my code: setwd("/Volumes/NATASHKA/api/R files") statadata <-read.dta("Data_wide_withdemogr_auris_for_kml_negative.dta") mydata <- data.frame(statadata) cldDQ <- cld(mydata) kml(cldDQ,c(2:6)

hierarchical cluster labeling with plots

妖精的绣舞 提交于 2019-12-08 00:04:12
问题 I have a distance matrix for ~20 elements, which I am using to do hierarchical clustering in R. Is there a way to label elements with a plot or a picture instead of just numbers, characters, etc? So, instead of the leaf nodes having numbers, it'd have small plots or pictures. Here is why I'm interested in this functionality. I have 2-D scatterplots like these (color indicates density) http://www.pnas.org/content/108/51/20455/F2.large.jpg (Note that this is not my own data) I have to analyze

R - cluster analysis on binary weblog data

↘锁芯ラ 提交于 2019-12-07 22:49:39
问题 I have a web data that looks similar to the sample below. It simply has the user and binary value for whether that user cliked on a particular link within a website. I wanted to do some clustering of this data. My main goal is to find similar users based on their online behaviour. What is a good clustering alorithm for this? I have tried k-means which does not work well with binary data. I have also tried spherical k-means skmeans() . I wanted to do a sum of squared error scree plot, but I

Hierarchical Agglomerative clustering in Spark

﹥>﹥吖頭↗ 提交于 2019-12-07 20:24:23
问题 I am working on a clustering problem and it has to be scalable for a lot of data. I would like to try hierarchical clustering in Spark and compare my results with other methods. I have done some research on the web about using hierarchical clustering with Spark but haven't found any promising information. If anyone has some insight about it, I would be very grateful. Thank you. 回答1: The Bisecting Kmeans Approach Seems to do a decent job, and runs quite fast in terms of performance. Here is a

Number clustering/partitioning algorithm

怎甘沉沦 提交于 2019-12-07 18:43:55
问题 I have an ordered 1-D array of numbers. Both the array length and the values of the numbers in the array are arbitrary. I want to partition the array into k partitions, according to the number values, e.g. let's say I want 4 partitions, distributed as 30% / 30% / 20% / 20%, i.e. the top 30% values first, the next 30% afterwards, etc. I get to choose k and the percentages of the distribution. In addition, if the same number appears more than once in the array, it should not be contained in two

Plot multi-dimension cluster to 2D plot python

空扰寡人 提交于 2019-12-07 18:41:29
问题 I was working on clustering a lot of data, which has two different clusters. The first type is a 6-dimensional cluster whereas the second type is a 12-dimensional cluster. For now I have decided to use kmeans (as it seems the most intuitive clustering algorithm for the start). The question is how can I map these clusters on a 2d plot so that I can infer whether kmeans is working or not. I would like to use matplotlib, but any other python package is fine. Cluster 1 is a cluster made up of

TCP/IP communication in Matlab

荒凉一梦 提交于 2019-12-07 18:18:38
问题 I want to build my own Matlab cluster from lots of junk computers. Anybody knows how to send data from one Matlab to another over TCP ? I need to send image chunks / .mat files and variables. Thanks, SW 回答1: You can use the Distributed Computing Toolbox ($$$) or the jPar utility from the File Exchange (free) 回答2: TCP/UDP/IP Toolbox 2.0.6 from Matlab Exchange offers a tcp/ip implementation. When I last checked, about a year ago, it was far more performant than the one available by the