cluster-analysis

K-means with really large matrix

落爺英雄遲暮 提交于 2019-12-18 15:48:33
问题 I have to perform a k-means clustering on a really huge matrix (about 300.000x100.000 values which is more than 100Gb). I want to know if I can use R software to perform this or weka. My computer is a multiprocessor with 8Gb of ram and hundreds Gb of free space. I have enough space for calculations but loading such a matrix seems to be a problem with R (I don't think that using the bigmemory package would help me and big matrix use automatically all my RAM then my swap file if not enough

How to calculate clustering entropy? A working example or software code [closed]

六眼飞鱼酱① 提交于 2019-12-18 12:02:41
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I would like to calculate entropy of this example scheme http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html Can anybody please explain step by step with real values? I know there are unliminted number of formulas but i am really bad at understanding formulas :) For example in the

How to perform clustering without removing rows where NA is present in R

二次信任 提交于 2019-12-18 11:33:52
问题 I have a data which contain some NA value in their elements. What I want to do is to perform clustering without removing rows where the NA is present. I understand that gower distance measure in daisy allow such situation. But why my code below doesn't work? I welcome other alternatives than 'daisy'. # plot heat map with dendogram together. library("gplots") library("cluster") # Arbitrarily assigning NA to some elements mtcars[2,2] <- "NA" mtcars[6,7] <- "NA" mydata <- mtcars hclustfunc <-

Efficient way of calculating likeness scores of strings when sample size is large?

早过忘川 提交于 2019-12-18 11:09:44
问题 Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list. I'm aware of how to calculate the Levenshtein distance between two strings (thanks to this question), which will give me a score of how many operations are needed to transform one string into another. Let's say that I define "suspiciously close to another email

How to group nearby latitude and longitude locations stored in SQL

十年热恋 提交于 2019-12-18 10:53:10
问题 Im trying to analyse data from cycle accidents in the UK to find statistical black spots. Here is the example of the data from another website. http://www.cycleinjury.co.uk/map I am currently using SQLite to ~100k store lat / lon locations. I want to group nearby locations together. This task is called cluster analysis. I would like simplify the dataset by ignoring isolated incidents and instead only showing the origin of clusters where more than one accident have taken place in a small area.

How can I perform K-means clustering on time series data?

六月ゝ 毕业季﹏ 提交于 2019-12-18 10:37:17
问题 How can I do K-means clustering of time series data? I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update the mean of the cluster for time series data. I have a set of labelled time series, and I want to use the K-means algorithm to check whether I will get back a similar label or not. My X matrix will be N X M, where N is number of time series and M is

Choosing eps and minpts for DBSCAN (R)?

自古美人都是妖i 提交于 2019-12-18 10:33:36
问题 I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as follows: library(fpc) ds <- dbscan(USArrests,eps=20) Choosing eps was merely by trial and error in this case. However I am wondering if there is a function or code available to automate the choice of the best eps/minpts. I know some books recommend producing a plot

Clustering Lat/Longs in a Database

☆樱花仙子☆ 提交于 2019-12-18 10:29:40
问题 I'm trying to see if anyone knows how to cluster some Lat/Long results, using a database, to reduce the number of results sent over the wire to the application. There are a number of resources about how to cluster, either on the client side OR in the server (application) side .. but not in the database side :( This is a similar question, asked by a fellow S.O. member. The solutions are server side based (ie. C# code behind). Has anyone had any luck or experience with solving this, but in a

Which machine learning library to use [closed]

早过忘川 提交于 2019-12-18 09:56:15
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . I am looking for a library that, ideally, has the following features: implements hierarchical clustering of multidimensional data (ideally on similiarity or distance matrix) implements support vector machines is in C++ is somewhat documented (this one seems to be hardest) I would like this to be in C++, as I am

FCM Clustering numeric data and csv/excel file

好久不见. 提交于 2019-12-18 09:29:17
问题 Hi I asked a previous question that gave a reasonable answer and I thought I was back on track, Fuzzy c-means tcp dump clustering in matlab the problem is the preprocessing stage of the below tcp/udp data that I would like to run through matlabs fcm clustering algorithm.My question: 1) how do i or what would be the best method to convert the text data in the cells to a numeric value? what should the numeric value be? Edit: My data in excel looks like this now: 0,tcp,http,SF,239,486,0,0,0,0,0