cluster-analysis | 易学教程

ELKI - Use List<String> of objects to populate the Database

阅读更多关于 ELKI - Use List of objects to populate the Database

问题 Sorry for the naive question, but I got stuck while following all the pieces of tutorials available. So, is there a way to populate a Database db from a simple List rather than loading it reading a file? Basically what I'm looking for is something similar to: List objects = ... Database db = ClassGenericsUtil.parameterizeOrAbort(ArrayDatabase.class, params, objects); db.initialize(); Thanks in advance. 回答1: What are the contents of your String s? Same as understood by the ELKI parsers? This

ELKI - Use List<String> of objects to populate the Database

阅读更多关于 ELKI - Use List of objects to populate the Database

How do I automate the number of clusters? [duplicate]

阅读更多关于 How do I automate the number of clusters? [duplicate]

问题 This question already has answers here : Cluster analysis in R: determine the optimal number of clusters (7 answers) Closed 10 months ago . I've been playing with the below script: from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score import textract import os folder_to_scan = '/media/sf_Documents/clustering' dict_of_docs = {} # Gets all the files to scan with textract for root, sub, files in os.walk

Unsupervised high dimension clustering

阅读更多关于 Unsupervised high dimension clustering

问题 I have dataset of records where each record is with 5 labels and the importance of each label is different. I know to labels order according to importance but don't know the differences, so the difference between two records is look like: a dist of label1 + b dist of label2 + c*dist of label3 such that a+b+c = 1. The data set contains around 3000 records and I want to cluster it(don't know the number of clusters) in some way. I thought about DBSCAN but it is not really good with high

Louvain community detection in R using igraph - format of edges and vertices

阅读更多关于 Louvain community detection in R using igraph - format of edges and vertices

问题 I have a correlation matrix of scores that I would like to run community detection on using the Louvain method in igraph, in R. I converted the correlation matrix to a distance matrix using cor2dist , as below: distancematrix <- cor2dist(correlationmatrix) This gives a 400 x 400 matrix of distances from 0-2. I then made the list of edges (the distances) and vertices (each of the 400 individuals) using the below method from http://kateto.net/networks-r-igraph (section 3.1). library(igraph)

Python number line cluster exercise

阅读更多关于 Python number line cluster exercise

问题 I am working through an exercise in my textbook (Ex 4.7) and am implementing the code in Python to practice dynamic programming. I am having some trouble actually executing Algorithm 4.8. I understand what is going on until I get to 'Otherwise range s from 1 to t-1 and set s to minimize f(s) . Why is the book using s in the for loop as well as setting it to the function f(s) ? How should one go about implementing that line in Python? [current code at bottom] My current code is this so far: x

Python number line cluster exercise

阅读更多关于 Python number line cluster exercise

Python number line cluster exercise

阅读更多关于 Python number line cluster exercise

Is it possible to use KDTree with cosine similarity?

阅读更多关于 Is it possible to use KDTree with cosine similarity?

问题 Looks like I can't use this similarity metric for with sklearn KDTree, for example, but I need because I am using measuring words vectors similarity. What is fast robust customization algorithm for this case? I know about Local Sensitivity Hashing , but it should tunned & tested up a lot to find params. 回答1: The ranking your would get with cosine similarity is equivalent to the rank order of the euclidean distance when you normalize all the data points first. So you can use a KD tree to the

Bag of feature: how to create the query histogram?

阅读更多关于 Bag of feature: how to create the query histogram?

问题 I'm trying to implement the Bag of Features model. Given a descriptors matrix object (representing an image) belonging to the initial dataset, compute its histogram is easy, since we already know to which cluster each descriptor vector belongs to from k-means. But what about if we want to compute the histogram of a query matrix? The only solution that crosses my mind is to compute the distance between each vector descriptor to each of the k cluster centroids. This can be inefficient: