cluster-analysis

Weka Clustering Results Differ for Same Settings

霸气de小男生 提交于 2019-12-11 06:29:53
问题 I am using Weka for clustering some data and was running into a very odd problem. When I use the normal "Cluster" Tool on a data set, I am getting a result of Cluster 1: 87 instances Cluster 2: 88 instances Cluster 3: 181 instances This is what I sort of expected from the data I had, so I consider this a good result. However, I want to add this cluster as a class and save it as a new .arff file, so I am trying to use the "Add Cluster" filter that Weka provides. Now, in this filter, I select

Python cluster variables in list of tuples by 2 factors silmutanously

∥☆過路亽.° 提交于 2019-12-11 06:02:44
问题 Hi guys I have a following code: from math import sqrt array = [(1,'a',10), (2,'a',11), (3,'c',200), (60,'a',12), (70,'t',13), (80,'g',300), (100,'a',305), (220,'c',307), (230,'t',306), (250,'g',302)] def stat(lst): """Calculate mean and std deviation from the input list.""" n = float(len(lst)) mean = sum([pair[0] for pair in lst])/n ## mean2 = sum([pair[2] for pair in lst])/n stdev = sqrt((sum(x[0]*x[0] for x in lst) / n) - (mean * mean)) ## stdev2 = sqrt((sum(x[2]*x[2] for x in lst) / n) -

MemoryError from sklearn.metrics.silhouette_samples

陌路散爱 提交于 2019-12-11 05:35:40
问题 I get a memory error when trying to call sklearn.metrics.silhouette_samples. My use case is identical to this tutorial. I am using scikit-learn 0.18.1 in Python 3.5. For the related function, silhouette_score , this post suggests the use of the sample_size parameter which reduces the sample size before calling silhouette_samples. I am not sure that the down-sampling would still produce reliable results so I hesitate to do that. My input, X, is a [107545 rows x 12 columns] dataframe which I

Clustering by groups [duplicate]

可紊 提交于 2019-12-11 05:00:29
问题 This question already has answers here : Add ID column by group [duplicate] (4 answers) Closed 2 years ago . How can I perform clustering by groups? For example, take this Pokemon dataset on Kaggle. A sample of this dataset looks like this (changed some fields to mimic my data): Name Type I Type II Bulbasaur Grass Poison Bulbasaur 2 Grass Poison Venusaur Grass Not Null VenusaurMega Venusaur Grass Not Null ... Charizard Fire Flying CharizardMega Charizard X Fire Dragon Supposing there are no

Simulating Co-occurrence data in R

家住魔仙堡 提交于 2019-12-11 04:46:07
问题 I am trying to create a data set of co-occurrence data where the variable of interest is a software application and I want to simulate an n by n matrix where each cell has a number that says the number of times application A was used with application B. How can I create a data set in R that I can use to test a set of clustering and partitioning algorithms. What model would I use and how would I generate the data in R ? 回答1: n <- 10 apps <- LETTERS[1:n] data <- matrix(0,n,n) rownames(data) <-

Clustering algorithm with different epsilons on different axes

一曲冷凌霜 提交于 2019-12-11 03:57:01
问题 I am looking for a clustering algorithm such a s DBSCAN do deal with 3d data, in which is possible to set different epsilons depending on the axis. So for instance an epsilon of 10m on the x-y plan, and an epsilon 0.2m on the z axis. Essentially, I am looking for large but flat clusters. Note: I am an archaeologist, the algorithm will be used to look for potential correlations between objects scattered in large surfaces, but in narrow vertical layers 回答1: Solution 1: Scale your data set to

MATLAB - Classification output

烈酒焚心 提交于 2019-12-11 03:52:02
问题 My programme uses K-means clustering of a set amount of clusters from the user. For this k=4 but I would like to run the clustered information through matlabs naive bayes classifier afterwards. Is there a way to split the clusters up and feed them into different naive classifiers in matlab? Naive Bayes: class = classify(test,training, target_class, 'diaglinear'); K-means: %% generate sample data K = 4; numObservarations = 5000; dimensions = 42; %% cluster opts = statset('MaxIter', 500,

Calculate local clustering coefficient of a vertex (node) with R (by hand)

杀马特。学长 韩版系。学妹 提交于 2019-12-11 03:22:16
问题 I found an example showing how to calculate LCC by hand (see image). How can I replicate these steps in R? Focus is on finding the "Actual number of links among Neighbors" (middle step) I would preferably have to calculate it by hand *Does the igraph package provide this number? Example adjacency matrix: matrix(data = c(0,1,0,1,1,0,0,1,1,1,0,1,1,0,1,0), ncol = 4) 回答1: All of this can be done in igraph . It is nice that you gave an example, but since the graph is fully connected, all vertices

Plotting clusters using k-means with distance from centroid

[亡魂溺海] 提交于 2019-12-11 03:13:40
问题 I am trying to create a plot similar to this: Here there are three clusters and all the datapoints (circles) are plotted according to their euclidean distance from the centroid. Using this image its easy to see that 5 samples from class 2 ended up in wrong clusters. I'm running k-means using kmeans and can't figure out how to plot this type of graph. For example purposes we can use the iris dataset. > iri <- iris > cl <- kmeans (iri[, 1:4], 3) > cl K-means clustering with 3 clusters of sizes

Clustering string data with ELKI

≯℡__Kan透↙ 提交于 2019-12-11 02:42:39
问题 I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I (a) load string data in ELKI from a file (only "Labels")? (b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?) Some code snippets or example input files would be helpful. 回答1: It's actually pretty straightforward: A )