cluster-analysis

Return the furthermost outlier in kmeans clustering? [closed]

好久不见. 提交于 2019-12-13 09:10:25
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . Is there any easy way to return the furthermost outlier after sklearn kmeans clustering? Essentially I want to make a list of the biggest outliers for a load of clusters. Unfortunately I need to use sklearn.cluster.KMeans due to the assignment. 回答1: Sascha basically gives it away in the comments, but if X denotes

Similar images: Bag of Features / Visual Word or matching descriptors?

不羁的心 提交于 2019-12-13 05:26:16
问题 I have an application where given a reasonable amount of images (let's say 20K) and a query image, I want to find the most similar one. An reasonable approximation is feasible. In order to guarantee precision in representing each image, I'm using SIFT (a parallel version, to achieve fast computation also). Now, given the set of n SIFT descriptors (where 500<n<1000 usually, depending on image size), which can be represented as a matrix n x 128 , from what I've seen in literature there are two

Hadoop and NLTK: Fails with stopwords

你。 提交于 2019-12-13 02:26:16
问题 I'm trying to run a Python program on Hadoop. The program involves the NLTK library. The program also utilizes the Hadoop Streaming API, as described here. mapper.py: #!/usr/bin/env python import sys import nltk from nltk.corpus import stopwords #print stopwords.words('english') for line in sys.stdin: print line, reducer.py: #!/usr/bin/env python import sys for line in sys.stdin: print line, Console command: bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py

How to decide group assignments in Dirichlet process clustering

旧巷老猫 提交于 2019-12-13 01:43:26
问题 As in the Dirichlet clustering, the dirichlet process can be represented by the following: Chinese Restaurant Process Stick Breaking Process Poly Urn Model For instance, if we consider Chinese Restaurant Process the process is as follows: Initially the restaurant is empty The first person to enter (Alice) sits down at a table (selects a group). The second person to enter (Bob) sits down at a table. Which table does he sit at? He sits down at a new table with probability α/(1+α) He sits with

Content Based Image Retrieval (CBIR): Bag of Features or Descriptors Matching?

泄露秘密 提交于 2019-12-13 00:19:20
问题 I've read a lot of papers about the Nearest Neighbor problem, and it seems that indexing techniques like randomized kd-trees or LSH has been successfully used for Content Based Image Retrieval (CBIR), which can operate in high dimensional space. One really common experiment is given a SIFT query vector, find the most similar SIFT descriptor in the dataset. If we repeat the process with all the detected SIFT descriptors we can find the most similar image. However, another popular approach is

>AttributeError: 'list' object has no attribute 'lower' (in a lowercase dataframe)

只愿长相守 提交于 2019-12-12 18:16:36
问题 I don't understand this error... I've already turned df into lowercase before turning it into a list dataframe: all_cols 0 who is your hero and why 1 what do you do to relax 2 this is a hero 4 how many hours of sleep do you get a night 5 describe the last time you were relax Code: from sklearn.cluster import MeanShift from sklearn.pipeline import Pipeline from sklearn.preprocessing import FunctionTransformer from sklearn.feature_extraction.text import TfidfVectorizer df['all_cols'] = df['all

Clustering for Categorical and Numerical data

核能气质少年 提交于 2019-12-12 18:06:53
问题 I have a collection of alerts and I want to group it based on similarity/distance. As we have non-numeric data, How can i perform clustering for this kind of data. set.seed(42) data.frame(Host1 = rep("del",10), Host2 = c(rep("cpp",4), rep("sscp",3), rep("portal",3)), Host3 = c(rep("web",5), rep("apache",3), rep("app",2)), Host4 = c(sample(3,8, replace = TRUE), rep("con",2)), Date1 = abs(round(1:10 + rnorm(10),2))) Host1 Host2 Host3 Host4 Date1 1 del cpp web 3 1.40 2 del cpp web 3 1.89 3 del

How to change dendrogram labels in r

安稳与你 提交于 2019-12-12 08:03:14
问题 I have a dendrogram in R. It is based on hierachical clustering using hclust. I am colouring labels that are different in different colours, but when I try changing the labels of my dedrogram (to the rows of the dataframe the cluster is based on) using dendrogram = dendrogram %>% set("labels", dataframe$column) the labels are replaced, but in the wrong positions. As example: My dendrogram looks like this: ___|___ | _|_ | | | | 1 0 2 when I now try changing the labels like specified above, the

k-means clustering on term-term co-ocurrence matrix

可紊 提交于 2019-12-12 06:31:57
问题 I derive a term-term co-occurrence matrix, K from a Document-Term Matrix in R. I am interested in carrying out a K-means clustering analysis on the keyword-by-keyword matrix, K. The dimension of K is 8962 terms x 8962 terms. I pass K to the kmeans function as follows: for(i in 1:25){ #Run kmeans for each level of i, allowing up to 100 iterations for convergence kmeans<- kmeans(x=K, centers=i, iter.max=100) #Combine cluster number and cost together, write to df cost_df<- rbind(cost_df, cbind(i

Extending Stargazer to multiwaycov

放肆的年华 提交于 2019-12-12 05:19:59
问题 I'm using stargazer to create regression outputs for my bachelor thesis. Due to the structure of my data I have to use clustered models (code below). I'm using the vcovclust command from the multiwaycov package, which works perfectly. However, stargazer does not support it. Do you know another way to create outputs as nice as stargazer does? Or do you know an other package/command to cluster the models, which is suppported by stargazer? model1.1.2 <- lm(leaflet ~ partisan + as.factor(gender)