cluster-analysis | 易学教程

Clustering by distance in R

阅读更多关于 Clustering by distance in R

问题 I have a vector of integers which I wish to divide into clusters so that the distance between any two clusters is greater than a lower bound, and within any cluster, the distance between two elements is less than an upper bound. For example, suppose we have the following vector: 1, 4, 5, 6, 9, 29, 32, 36 And set the aforementioned lower bound and upper bound to 19 and 9 respectively, the two vectors below should be a possible result: 1, 4, 5, 6, 9 29, 32, 36 Thanks to @flodel 's comments, I

Using StringToWordVector in Weka with internal data structures

阅读更多关于 Using StringToWordVector in Weka with internal data structures

问题 I am trying to obtain document clustering using Weka. The process is a part of a larger pipeline, and I really can't afford to write out arff files. I have all the documents and the bag of words in each document as a Map<String, Multiset<String>> structure, where the keys are document names, and the Multiset<String> values are the bags of words in the documents. I have two questions, really: (1) Current approach ends up clustering terms, not documents: public final Instances

Clustering Time Series Data of Different Length

阅读更多关于 Clustering Time Series Data of Different Length

问题 I have time series data of different length of series. I want to cluster based upon DTW distance but could not find ant library regarding it. sklearn give straight error while tslearn kmeans gave wrong answer. My problem is solving if I pad it with zeros but I am not sure if this is correct to pad time-series data while clustering. The suggestion about other clustering technique about time series data are welcomed. max_length = 0 for i in train_1: if(len(i)>max_length): max_length = len(i)

Trouble with scipy kmeans and kmeans2 clustering in Python

阅读更多关于 Trouble with scipy kmeans and kmeans2 clustering in Python

问题 I have a question about scipy's kmeans and kmeans2 . I have a set of 1700 lat-long data points. I want to spatially cluster them into 100 clusters. However, I get drastically different results when using kmeans vs kmeans2 . Can you explain why this is? My code is below. First I load my data and plot the coordinates. It all looks correct. import pandas as pd, numpy as np, matplotlib.pyplot as plt from scipy.cluster.vq import kmeans, kmeans2, whiten df = pd.read_csv('data.csv') df.head()

In R, is there an algorithm to create approximately equal sized clusters

阅读更多关于 In R, is there an algorithm to create approximately equal sized clusters

问题 There seems to be a lot of information about creating either hierarchical or k-means clusters. But I would like to know if there is an solution in R that would create K clusters of approximately equal sizes. There is some stuff out there about doing this in other languages, but I have not been able to find anything from searching on the internet that suggests how to achieve the result in R. An example would be set.seed(123) df <- matrix(rnorm(100*5), nrow=100) km <- kmeans(df, 10) print

Generate random points distributed like cities?

阅读更多关于 Generate random points distributed like cities?

问题 How can one generate say 1000 random points with a distribution like that of towns and cities in e.g. Ohio ? I'm afraid I can't define "distributed like cities" precisely; uniformly distributed centres + small Gaussian clouds are easy but ad hoc. Added: There must be a family of 2d distributions with a clustering parameter that can be varied to match a given set of points ? 回答1: Maybe you can take a look at Walter Christaller's Theory of Central Places. I guess there must be some generator

Correlating word proximity

阅读更多关于 Correlating word proximity

问题 Let's say I have a text transcript of a dialogue over a period of aprox. 1 hour. I want to know what words happen in close proximatey to one another. What type of statistical technique would I use to determine what words are clustered together and how close their proximatey to one another is? I'm suspecting some sort of cluster analysis or PCA. 回答1: To determine word proximity, you will have to build a graph: each word is a vertex (or "node"), and left and right words are edges So "I like

Python: DBSCAN in 3 dimensional space

阅读更多关于 Python: DBSCAN in 3 dimensional space

问题 I have been searching around for an implementation of DBSCAN for 3 dimensional points without much luck. Does anyone know I library that handles this or has any experience with doing this? I am assuming that the DBSCAN algorithm can handle 3 dimensions, by having the e value be a radius metric and the distance between points measured by euclidean separation. If anyone has tried implementing this and would like to share that would also be greatly appreciated, thanks. 回答1: You can use sklearn

Clustering and Bayes classifiers Matlab

阅读更多关于 Clustering and Bayes classifiers Matlab

问题 So I am at a cross roads on what to do next, I set out to learn and apply some machine learning algorithms on a complicated dataset and I have now done this. My plan from the very beginning was to combine two possible classifiers in an attempt to make a multi-classification system. But here is where I am stuck. I choose a clustering algorithm (Fuzzy C Means) (after learning some sample K-means stuff) and Naive Bayes as the two candidates for the MCS (Multi-Classifier System). I can use both

Grid search for hyperparameter evaluation of clustering in scikit-learn

阅读更多关于 Grid search for hyperparameter evaluation of clustering in scikit-learn

问题 I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which works fine. My problem here is that I don't need to use the cross-validation aspect of the GridSearchCV / RandomizedSearchCV , but I can't find a simple GridSearch / RandomizedSearch . I can write my own but the ParameterSampler and ParameterGrid objects are very useful. My next step will be to