cluster-analysis

basic clustering with r

若如初见. 提交于 2019-12-11 11:16:36
问题 I'm new to R and data analysis. I'm trying to create a simple custom recommendation system for a web site. So, as input information I have user/session-id,item-id,item-price which users clicked at. c165c2ee-81cf-48cf-ba3f-83b70204c00c 161785 124.0 a886fdd5-7cee-4152-b1b7-77a2702687b0 643339 42.0 5e5fd670-b104-445b-a36d-b3798cd43279 131332 38.0 888d736f-99bc-49ca-969d-057e7d4bb8d1 1032763 39.0 I would like to apply cluster analysis to that data. If I try to apply k-means clustering to my data:

Clustering of images to evaluate diversity (Weka?)

五迷三道 提交于 2019-12-11 10:43:24
问题 Within a university course I have some features of images (as text files). I have to rank those images according to their diversity.# The idea I have in mind is to feed a k-means classifier with the images and then compute the euclidian-distance from the images within a cluster to the cluster's centroïd. Then do a rotation between clusters and take always the (next) closest image to the centroïd. I.e., return closest to centroïd 1, then closest to centroïd 2, then 3.... then second closest to

Algorithm to to Cluster Similar Strings in Python?

跟風遠走 提交于 2019-12-11 09:48:29
问题 I'm working on a script that currently contains multiple lists of DNA sequences (each list has a varying number of DNA sequences) and I need to cluster the sequences in each list based on Hamming Distance similarity. My current implementation of this (very crude at the moment) extracts the first sequence in the list and calculates the Hamming Distance of each subsequent sequence. If it's within a certain Hamming Distance, it appends it to a new list which later is used to remove sequences

cluster one-dimensional data using pvclust

自古美人都是妖i 提交于 2019-12-11 09:14:17
问题 Thanks for taking time read this question. I have some one-dimensional data to cluster in R. The basic hclust command works fine. But the pvclust command, however, does not take one-dimensional data, and keeps saying: Error in hclust(distance, method = method.hclust) : must have n >= 2 objects to cluster I found a work-around, that I added some all-zero rows to the data. So the data becomes: [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 7.424 14.251 15.957 1.542 2.451 20.836 13.534

Selecting non-overlapping best quality clusters

♀尐吖头ヾ 提交于 2019-12-11 09:14:04
问题 Say, I have done clustering on my dataset and have 10 clusters. These clusters are non-overlapping. But now assume I changed some feature in all my data points and do clustering again. Now I have 10 more clusters. If I repeat it say 3 more times, at the end I would have 50 clusters. Each cluster has a score associated with it that is calculated from its constituents data points. These 50 clusters now have overlapping data points. I want to select all possible non-overlapping clusters out of

Automated grouping in SAS with minimizing variance within group

北城以北 提交于 2019-12-11 09:09:06
问题 So I tried to build the automated grouping. The goal is to select the grouping setting that has the lowest variance. In other word, I want to find x and y for the following, x,y are natural number, GROUP 1: 1997 - x GROUP 2: x+1 - y GROUP 3: y+1 - 1994 such that the SUM of (variance( Response in Group1),variance( Response in Group2),variance( Response in Group3)) are minimize. data maindat; input Year Response ; datalines; 1994 -4.300511714 1994 -9.646920963 1994 -15.86956805 1993 -16

Cut off point in k-means clustering in sas

好久不见. 提交于 2019-12-11 08:57:51
问题 So I want to classify my data into clusters with cut-off point in SAS. The method I use is k-means clustering. (I don't mind about the method, as long as, it gives me 3 groups.) My code for clustering: proc fastclus data=maindat outseed=seeds1 maxcluster =3 maxiter=0; var value resid; run; I have the problem with the output result. I want the cut-off point for the Value to be include in the output file. (I don't want the cut-off point for Resid). So is there anyway to do this in SAS? Edit: As

R: igraph, matching members of a “known” cluster to members of observed clusters returning a %match

霸气de小男生 提交于 2019-12-11 08:19:18
问题 I'm using the Walktrap community detection method to return a number (19 in this case) of clusters. I have a list of members which belong to one or more of these clusters. I need a method to search each cluster for the presence of the members and return the percentage of matches found. ( e.g cluster[0] = 0%, cluster[1] =Y%.....cluster[18]=Z%) Thus selecting the optimum cluster that represents the members on the list. Once the optimum cluster is found, I need a method to count the number of

'numpy.float64' object is not iterable - meanshift clustering

血红的双手。 提交于 2019-12-11 07:28:52
问题 python newbie here. I am trying to run this code but I get the error message that the object is not iterable. Would appreciate some advice on what I am doing wrong. Thanks. import matplotlib.pyplot as plt import numpy as np import pandas as pd temp = pd.read_csv("file.csv", encoding='latin-1') xy = temp.ix[:,2:6] X = xy.values X array([[ nan, nan], [ nan, nan], [ 3.92144000e+00, nan], [ 4.42382000e+00, nan], [ 4.18931000e+00, 5.61562775e+02], [ nan, nan], [ 4.33025000e+00, 6.73123391e+02], [

Kmeans clustering using jaccard distance matrix

这一生的挚爱 提交于 2019-12-11 07:14:39
问题 I'm trying to create Jaccard distance matrix and perform K-means on it to give out cluster ids and the ids of elements in the cluster. The input for it is twitter tweets. The following is the code and i couldn't understand how to use initial seeds from a file for kmeans. install.packages("rjson" ,dependencies=TRUE) library("rjson") install.packages("jsonlite" ,dependencies=TRUE) library("jsonlite") install.packages("stringdist" ,dependencies=TRUE) library("stringdist") data <- fromJSON