Parallel distance Matrix in R

前端 未结 6 706
忘掉有多难
忘掉有多难 2020-12-08 17:14

currently I\'m using the build in function dist to calculate my distance matrix in R.

dist(featureVector,method=\"manhattan\")

This is curr

6条回答
  •  执笔经年
    2020-12-08 17:42

    I am a windows user looking for an efficient way to compute the distance matrix to use it in a hierarchical clustering (using the function hclust from the "stats" package for example). The function Dist doesn't work in parallel in Windows so I had to look for something different, and I found the "wordspace" package of Stefan Evert which contains the dist.matrix function. You can try this code:

    X <- data.frame(replicate(1000,sample(0:1,5000,rep=TRUE)))
    system.time(d <- dist(X, method = "manhattan"))
    system.time(d2 <- as.dist( dist.matrix(as.matrix(X), method="manhattan") ))
    

    As you can see computing the distance matrix for a dataframe with 1000 binary features and 5000 instances is much faster with dist.matrix

    These are the results in my laptop (i7-6500U):

    > system.time(d <- dist(X, method = "manhattan"))
       user  system elapsed 
     151.79    0.04  152.59 
    > system.time(d2 <- as.dist( dist.matrix(as.matrix(X), method="manhattan") ))
       user  system elapsed 
      19.19    0.22   19.56 
    

    This solved my problem. Here you can check the original thread where I found it: http://r.789695.n4.nabble.com/Efficient-distance-calculation-on-big-matrix-td4633598.html

    It doesn´t solve it in parallel but is enough in many occasions.

提交回复
热议问题