cluster-analysis | 易学教程

How to get the largest possible column sequence with the least possible row NAs from a huge matrix?

阅读更多关于 How to get the largest possible column sequence with the least possible row NAs from a huge matrix?

I want to select columns from a data frame so that the resulting continuous column-sequences are as long as possible, while the number of rows with NAs is as small as possible, because they have to be dropped afterwards. (The reason I want to do this is, that I want to run TraMineR::seqsubm() to automatically get a matrix of transition costs (by transition probability) and later run cluster::agnes() on it. TraMineR::seqsubm() doesn't like NA states and cluster::agnes() with NA states in the matrix doesn't necessarily make much sense.) For that purpose I already wrote a working function that

How to programmatically determine the column indices of principal components using FactoMineR package?

阅读更多关于 How to programmatically determine the column indices of principal components using FactoMineR package?

问题 Given a data frame containing mixed variables (i.e. both categorical and continuous) like, digits = 0:9 # set seed for reproducibility set.seed(17) # function to create random string createRandString <- function(n = 5000) { a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE)) paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE)) } df <- data.frame(ID=c(1:10), name=sample(letters[1:10]), studLoc=sample(createRandString(10)), finalmark=sample(c(0:100),10),

How to create a binary matrix of inventory per row? (R)

阅读更多关于 How to create a binary matrix of inventory per row? (R)

问题 I have a dataframe of 9 columns consisting of an inventory of factors. Each row can have all 9 columns filled (as in that row is holding 9 "things"), but most don't (most have between 3-4). The columns aren't specific either, as in if item 200 shows up in columns 1 and 3, it's the same thing. I'd like to create a matrix that is binary for each row that includes all factors. Ex (shortened to 4 columns just to get point across) R1 3 4 5 8 R2 4 6 7 NA R3 1 5 NA NA R4 2 6 8 9 Should turn into 1 2

Calculating a Voronoi diagram for planes in 3D

阅读更多关于 Calculating a Voronoi diagram for planes in 3D

Is there a code/library that can calculate a Voronoi diagram for planes (parallelograms) in 3D? I checked Qhull and it seems it can only work with points, in its examples Voro++ works with different size of spheres but I couldn't find anything for polygons. In this image (sample planes in 3d) the parallelograms are 3D since they have a thickness, but in this case the thickness will be zero.! Voronoi cells are not parallelograms. You are confused here by the image you posted. Voronoi cell borders are parts of the hyperplanes that are separating the individual means. Check out this website

Cluster unseen points using Spectral Clustering

阅读更多关于 Cluster unseen points using Spectral Clustering

I am using Spectral Clustering method to cluster my data. The implementation seems to work properly. However, I have one problem - I have a set of unseen points (not present in the training set) and would like to cluster these based on the centroids derived by k-means (Step 5 in the paper). However, the k-means is computed on the k eigenvectors and therefore the centroids are low-dimensional. Does any-one knows a method that can be used to map an unseen point to a low-dimension and compute the distance between the projected point and the centroids derived by k-means in step 5. Late answer, but

Using StringToWordVector in Weka with internal data structures

阅读更多关于 Using StringToWordVector in Weka with internal data structures

I am trying to obtain document clustering using Weka. The process is a part of a larger pipeline, and I really can't afford to write out arff files. I have all the documents and the bag of words in each document as a Map<String, Multiset<String>> structure, where the keys are document names, and the Multiset<String> values are the bags of words in the documents. I have two questions, really: (1) Current approach ends up clustering terms, not documents: public final Instances buildDocumentInstances(TreeMap<String, Multiset<String>> docToTermsMap, String encoding) throws IOException { int

Clustering Time Series Data of Different Length

阅读更多关于 Clustering Time Series Data of Different Length

I have time series data of different length of series. I want to cluster based upon DTW distance but could not find ant library regarding it. sklearn give straight error while tslearn kmeans gave wrong answer. My problem is solving if I pad it with zeros but I am not sure if this is correct to pad time-series data while clustering. The suggestion about other clustering technique about time series data are welcomed. max_length = 0 for i in train_1: if(len(i)>max_length): max_length = len(i) print(max_length) train_1 = sequence.pad_sequences(train_1, maxlen=max_length) km3 = TimeSeriesKMeans(n

Affinity Propagation preferences initialization

阅读更多关于 Affinity Propagation preferences initialization

问题 I need to perform clustering without knowing in advance the number of clusters. The number of cluster may be from 1 to 5, since I may find cases where all the samples belong to the same instance, or to a limited number of group. I thought affinity propagation could be my choice, since I could control the number of clusters by setting the preference parameter. However, if I have a single cluster artificially generated and I set preference to the minimal euclidean distance among nodes (to

drawing heatmap with dendrogram along with sample labels

阅读更多关于 drawing heatmap with dendrogram along with sample labels

Using the heatmap function of made4 , I made this heatmap dendrogram from the example file: data(khan) heatplot(khan$train[1:30,], lowcol="blue", highcol="red") How can I add a panel of labels for the samples on the edges of the heatmap, like in this figure? The labels in this case are the squares that are adjacent to the heatmap first col and top row, used to denote a label for each sample so that one can see if the labels correspond with the clustering shown by the heatmap/dendrogram. In this particular plot they chose to make those labels correspond exactly to the colors of the dendrogram

Trouble with scipy kmeans and kmeans2 clustering in Python

阅读更多关于 Trouble with scipy kmeans and kmeans2 clustering in Python

I have a question about scipy's kmeans and kmeans2 . I have a set of 1700 lat-long data points. I want to spatially cluster them into 100 clusters. However, I get drastically different results when using kmeans vs kmeans2 . Can you explain why this is? My code is below. First I load my data and plot the coordinates. It all looks correct. import pandas as pd, numpy as np, matplotlib.pyplot as plt from scipy.cluster.vq import kmeans, kmeans2, whiten df = pd.read_csv('data.csv') df.head() coordinates = df.as_matrix(columns=['lon', 'lat']) plt.figure(figsize=(10, 6), dpi=100) plt.scatter