cluster-analysis | 易学教程

K-means with really large matrix

阅读更多关于 K-means with really large matrix

问题 I have to perform a k-means clustering on a really huge matrix (about 300.000x100.000 values which is more than 100Gb). I want to know if I can use R software to perform this or weka. My computer is a multiprocessor with 8Gb of ram and hundreds Gb of free space. I have enough space for calculations but loading such a matrix seems to be a problem with R (I don't think that using the bigmemory package would help me and big matrix use automatically all my RAM then my swap file if not enough

How to calculate clustering entropy? A working example or software code [closed]

阅读更多关于 How to calculate clustering entropy? A working example or software code [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I would like to calculate entropy of this example scheme http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html Can anybody please explain step by step with real values? I know there are unliminted number of formulas but i am really bad at understanding formulas :) For example in the

How to perform clustering without removing rows where NA is present in R

阅读更多关于 How to perform clustering without removing rows where NA is present in R

问题 I have a data which contain some NA value in their elements. What I want to do is to perform clustering without removing rows where the NA is present. I understand that gower distance measure in daisy allow such situation. But why my code below doesn't work? I welcome other alternatives than 'daisy'. # plot heat map with dendogram together. library("gplots") library("cluster") # Arbitrarily assigning NA to some elements mtcars[2,2] <- "NA" mtcars[6,7] <- "NA" mydata <- mtcars hclustfunc <-

Efficient way of calculating likeness scores of strings when sample size is large?

阅读更多关于 Efficient way of calculating likeness scores of strings when sample size is large?

问题 Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list. I'm aware of how to calculate the Levenshtein distance between two strings (thanks to this question), which will give me a score of how many operations are needed to transform one string into another. Let's say that I define "suspiciously close to another email

How to group nearby latitude and longitude locations stored in SQL

阅读更多关于 How to group nearby latitude and longitude locations stored in SQL

问题 Im trying to analyse data from cycle accidents in the UK to find statistical black spots. Here is the example of the data from another website. http://www.cycleinjury.co.uk/map I am currently using SQLite to ~100k store lat / lon locations. I want to group nearby locations together. This task is called cluster analysis. I would like simplify the dataset by ignoring isolated incidents and instead only showing the origin of clusters where more than one accident have taken place in a small area.

How can I perform K-means clustering on time series data?

阅读更多关于 How can I perform K-means clustering on time series data?

问题 How can I do K-means clustering of time series data? I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update the mean of the cluster for time series data. I have a set of labelled time series, and I want to use the K-means algorithm to check whether I will get back a similar label or not. My X matrix will be N X M, where N is number of time series and M is

Choosing eps and minpts for DBSCAN (R)?

阅读更多关于 Choosing eps and minpts for DBSCAN (R)?

问题 I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as follows: library(fpc) ds <- dbscan(USArrests,eps=20) Choosing eps was merely by trial and error in this case. However I am wondering if there is a function or code available to automate the choice of the best eps/minpts. I know some books recommend producing a plot

Clustering Lat/Longs in a Database

阅读更多关于 Clustering Lat/Longs in a Database

问题 I'm trying to see if anyone knows how to cluster some Lat/Long results, using a database, to reduce the number of results sent over the wire to the application. There are a number of resources about how to cluster, either on the client side OR in the server (application) side .. but not in the database side :( This is a similar question, asked by a fellow S.O. member. The solutions are server side based (ie. C# code behind). Has anyone had any luck or experience with solving this, but in a

Which machine learning library to use [closed]

阅读更多关于 Which machine learning library to use [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . I am looking for a library that, ideally, has the following features: implements hierarchical clustering of multidimensional data (ideally on similiarity or distance matrix) implements support vector machines is in C++ is somewhat documented (this one seems to be hardest) I would like this to be in C++, as I am

FCM Clustering numeric data and csv/excel file

阅读更多关于 FCM Clustering numeric data and csv/excel file

问题 Hi I asked a previous question that gave a reasonable answer and I thought I was back on track, Fuzzy c-means tcp dump clustering in matlab the problem is the preprocessing stage of the below tcp/udp data that I would like to run through matlabs fcm clustering algorithm.My question: 1) how do i or what would be the best method to convert the text data in the cells to a numeric value? what should the numeric value be? Edit: My data in excel looks like this now: 0,tcp,http,SF,239,486,0,0,0,0,0