cluster-analysis | 易学教程

What is the difference between a Confusion Matrix and Contingency Table?

阅读更多关于 What is the difference between a Confusion Matrix and Contingency Table?

问题 I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n matrix like A = {aij} where aij is the number of data points that are members of class ci and elements of cluster kj . But there appear to be two of this type of matrix in Introduction to Data Mining (Pang-Ning Tan et al.), one is the Confusion Matrix, the other is the Contingency Table. I do not fully understand the difference between the two. Which

Which data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?

阅读更多关于 Which data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?

问题 Here's my scenario. Consider a set of events that happen at various places and times - as an example, consider someone high above recording the lightning strikes in a city during a storm. For my purpose, lightnings are instantaneous and can only hit certain locations (such as high buildings). Also imagine each lightning strike has a unique id so one can reference the strike later. There are about 100,000 such locations in this city (as you guess, this is an analogy as my current employer is

Algorithm for clustering with minimum size constraints

阅读更多关于 Algorithm for clustering with minimum size constraints

I have a set of data clustering into k groups, each cluster has a minimum size constraint of m I've done some reclustering of the data. So now I got this set of points that each one has one or more better clusters to be in, but cannot be switched individually because it'll violate the size constraint. Goal : minimize the sum of distance from each point to its cluster center. Subject to : Minimum cluster size m I want to find an algorithm to reassign all points without violating the constraint, while guaranteed to decrease the objective. I thought of using Graph to represent pair wise

Better text documents clustering than tf/idf and cosine similarity?

阅读更多关于 Better text documents clustering than tf/idf and cosine similarity?

问题 I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad. The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences: 1- The website Stackoverflow is a nice place

Plot dendrogram using sklearn.AgglomerativeClustering

阅读更多关于 Plot dendrogram using sklearn.AgglomerativeClustering

问题 I'm trying to build a dendrogram using the children_ attribute provided by AgglomerativeClustering , but so far I'm out of luck. I can't use scipy.cluster since agglomerative clustering provided in scipy lacks some options that are important to me (such as the option to specify the amount of clusters). I would be really grateful for a any advice out there. import sklearn.cluster clstr = cluster.AgglomerativeClustering(n_clusters=2) clusterer.children_ 回答1: Here is a simple function for taking

How to programmatically determine the column indices of principal components using FactoMineR package?

阅读更多关于 How to programmatically determine the column indices of principal components using FactoMineR package?

Given a data frame containing mixed variables (i.e. both categorical and continuous) like, digits = 0:9 # set seed for reproducibility set.seed(17) # function to create random string createRandString <- function(n = 5000) { a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE)) paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE)) } df <- data.frame(ID=c(1:10), name=sample(letters[1:10]), studLoc=sample(createRandString(10)), finalmark=sample(c(0:100),10), subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10) ) I perform unsupervised feature selection

how to do clustering when the shape of data is (x,y,z)?

阅读更多关于 how to do clustering when the shape of data is (x,y,z)?

问题 suppose i have 10 individual observations each of size (125,59). i want to group these 10 observations based on their 2d feature matrices ((125,59)).Is this possible without flattening every observation to 125*59 1D matrix ? I cant even implement PCA or LDA for feature extraction because the data is highly variant. Please note that i am trying to implement clustering through self organizing maps or neural networks. Deep learning and neural networks are completely related to the question asked

ELKI Kmeans clustering Task failed error for high dimensional data

阅读更多关于 ELKI Kmeans clustering Task failed error for high dimensional data

问题 I have a 60000 documents which i processed in gensim and got a 60000*300 matrix. I exported this as a csv file. When i import this in ELKI environment and run Kmeans clustering, i am getting below error. Task failed de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation

Scipy.cluster.hierarchy.fclusterdata + distance measure

阅读更多关于 Scipy.cluster.hierarchy.fclusterdata + distance measure

1) I am using scipy's hcluster module. so the variable that I have control over is the threshold variable. How do I know my performance per threshold? i.e. In Kmeans, this performance will be the sum of all the points to their centroids. Of course, this has to be adjusted since more clusters = less distance generally. Is there an observation that I can do with hcluster for this? 2) I am realize there are tons of metrics available for fclusterdata. I am clustering of text documents based on tf-idf of key terms. The deal is, some document are longer than others, and I think that cosine is a good

Dendrogram or Other Plot from Distance Matrix

阅读更多关于 Dendrogram or Other Plot from Distance Matrix

I have three matrices to compare. Each of them is 5x6. I originally wanted to use hierarchical clustering to cluster the matrices, such that the most similar matrices are grouped, given a threshold of similarity. I could not find any such functions in python, so I implemented the distance measure by hand, (p-norm where p=2) . Now I have a 3x3 distance matrix (which I believe is also a similarity matrix in this case). I am now trying to produce a dendrogram. This is my code, and this is what is wrong. I want to produce a graph (a dendrogram if possible) that shows clusters of the matrices that