cluster-analysis | 易学教程

basic clustering with r

阅读更多关于 basic clustering with r

问题 I'm new to R and data analysis. I'm trying to create a simple custom recommendation system for a web site. So, as input information I have user/session-id,item-id,item-price which users clicked at. c165c2ee-81cf-48cf-ba3f-83b70204c00c 161785 124.0 a886fdd5-7cee-4152-b1b7-77a2702687b0 643339 42.0 5e5fd670-b104-445b-a36d-b3798cd43279 131332 38.0 888d736f-99bc-49ca-969d-057e7d4bb8d1 1032763 39.0 I would like to apply cluster analysis to that data. If I try to apply k-means clustering to my data:

Clustering of images to evaluate diversity (Weka?)

阅读更多关于 Clustering of images to evaluate diversity (Weka?)

问题 Within a university course I have some features of images (as text files). I have to rank those images according to their diversity.# The idea I have in mind is to feed a k-means classifier with the images and then compute the euclidian-distance from the images within a cluster to the cluster's centroïd. Then do a rotation between clusters and take always the (next) closest image to the centroïd. I.e., return closest to centroïd 1, then closest to centroïd 2, then 3.... then second closest to

Algorithm to to Cluster Similar Strings in Python?

阅读更多关于 Algorithm to to Cluster Similar Strings in Python?

问题 I'm working on a script that currently contains multiple lists of DNA sequences (each list has a varying number of DNA sequences) and I need to cluster the sequences in each list based on Hamming Distance similarity. My current implementation of this (very crude at the moment) extracts the first sequence in the list and calculates the Hamming Distance of each subsequent sequence. If it's within a certain Hamming Distance, it appends it to a new list which later is used to remove sequences

cluster one-dimensional data using pvclust

阅读更多关于 cluster one-dimensional data using pvclust

问题 Thanks for taking time read this question. I have some one-dimensional data to cluster in R. The basic hclust command works fine. But the pvclust command, however, does not take one-dimensional data, and keeps saying: Error in hclust(distance, method = method.hclust) : must have n >= 2 objects to cluster I found a work-around, that I added some all-zero rows to the data. So the data becomes: [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 7.424 14.251 15.957 1.542 2.451 20.836 13.534

Selecting non-overlapping best quality clusters

阅读更多关于 Selecting non-overlapping best quality clusters

问题 Say, I have done clustering on my dataset and have 10 clusters. These clusters are non-overlapping. But now assume I changed some feature in all my data points and do clustering again. Now I have 10 more clusters. If I repeat it say 3 more times, at the end I would have 50 clusters. Each cluster has a score associated with it that is calculated from its constituents data points. These 50 clusters now have overlapping data points. I want to select all possible non-overlapping clusters out of

Automated grouping in SAS with minimizing variance within group

阅读更多关于 Automated grouping in SAS with minimizing variance within group

问题 So I tried to build the automated grouping. The goal is to select the grouping setting that has the lowest variance. In other word, I want to find x and y for the following, x,y are natural number, GROUP 1: 1997 - x GROUP 2: x+1 - y GROUP 3: y+1 - 1994 such that the SUM of (variance( Response in Group1),variance( Response in Group2),variance( Response in Group3)) are minimize. data maindat; input Year Response ; datalines; 1994 -4.300511714 1994 -9.646920963 1994 -15.86956805 1993 -16

Cut off point in k-means clustering in sas

阅读更多关于 Cut off point in k-means clustering in sas

问题 So I want to classify my data into clusters with cut-off point in SAS. The method I use is k-means clustering. (I don't mind about the method, as long as, it gives me 3 groups.) My code for clustering: proc fastclus data=maindat outseed=seeds1 maxcluster =3 maxiter=0; var value resid; run; I have the problem with the output result. I want the cut-off point for the Value to be include in the output file. (I don't want the cut-off point for Resid). So is there anyway to do this in SAS? Edit: As

R: igraph, matching members of a “known” cluster to members of observed clusters returning a %match

阅读更多关于 R: igraph, matching members of a “known” cluster to members of observed clusters returning a %match

问题 I'm using the Walktrap community detection method to return a number (19 in this case) of clusters. I have a list of members which belong to one or more of these clusters. I need a method to search each cluster for the presence of the members and return the percentage of matches found. ( e.g cluster[0] = 0%, cluster[1] =Y%.....cluster[18]=Z%) Thus selecting the optimum cluster that represents the members on the list. Once the optimum cluster is found, I need a method to count the number of

'numpy.float64' object is not iterable - meanshift clustering

阅读更多关于 'numpy.float64' object is not iterable - meanshift clustering

问题 python newbie here. I am trying to run this code but I get the error message that the object is not iterable. Would appreciate some advice on what I am doing wrong. Thanks. import matplotlib.pyplot as plt import numpy as np import pandas as pd temp = pd.read_csv("file.csv", encoding='latin-1') xy = temp.ix[:,2:6] X = xy.values X array([[ nan, nan], [ nan, nan], [ 3.92144000e+00, nan], [ 4.42382000e+00, nan], [ 4.18931000e+00, 5.61562775e+02], [ nan, nan], [ 4.33025000e+00, 6.73123391e+02], [

Kmeans clustering using jaccard distance matrix

阅读更多关于 Kmeans clustering using jaccard distance matrix

问题 I'm trying to create Jaccard distance matrix and perform K-means on it to give out cluster ids and the ids of elements in the cluster. The input for it is twitter tweets. The following is the code and i couldn't understand how to use initial seeds from a file for kmeans. install.packages("rjson" ,dependencies=TRUE) library("rjson") install.packages("jsonlite" ,dependencies=TRUE) library("jsonlite") install.packages("stringdist" ,dependencies=TRUE) library("stringdist") data <- fromJSON