Parallel distance Matrix in R

前端 未结 6 703
忘掉有多难
忘掉有多难 2020-12-08 17:14

currently I\'m using the build in function dist to calculate my distance matrix in R.

dist(featureVector,method=\"manhattan\")

This is curr

相关标签:
6条回答
  • I've found parallelDist to be orders of magnitude faster than dist, and chewing up much less virtual memory in the process, on my Mac under Microsoft R Open 3.4.0. A word of warning though - I've had no luck compiling it on R 3.3.3. It doesn't list the version of R as a dependency but I suspect it is.

    0 讨论(0)
  • 2020-12-08 17:17

    The R package amap provides robust and parallelized functions for Clustering and Principal Component Analysis. Among these functions, Dist method offers what you are looking for: computes and returns the distance matrix in a parallel manner.

    Dist(x, method = "euclidean", nbproc = 8)
    

    The code above compute euclidean distance with 8 threads.

    0 讨论(0)
  • 2020-12-08 17:25

    Here's the structure for one route you could go. It is not faster than just using the dist() function, instead taking many times longer. It does process in parallel, but even if the computation time were reduced to zero, the time to start up the function and export the variables to the cluster would probably be longer than just using dist()

    library(parallel)
    
    vec.array <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)
    
    TaxiDistFun <- function(one.vec, whole.matrix) {
        diff.matrix <- t(t(whole.matrix) - one.vec)
        this.row <- apply(diff.matrix, 1, function(x) sum(abs(x)))
        return(this.row)
    }
    
    cl <- makeCluster(detectCores())
    clusterExport(cl, list("vec.array", "TaxiDistFun"))
    
    system.time(dist.array <- parRapply(cl, vec.array,
                            function(x) TaxiDistFun(x, vec.array)))
    
    stopCluster(cl)
    
    dim(dist.array) <- c(2000, 2000)
    
    0 讨论(0)
  • 2020-12-08 17:27

    You can also use the parDist function of the parallelDist package, which is specifically built for parallelized distance matrix computations. Advantages are that the package is available on Mac OS, Windows and Linux and already supports 39 different distance measures (see parDist).

    Performance comparison for manhattan distance (Sys spec: Mac OS; Intel Core i7 with 4 cores @ 2,5 GHz and hyperthreading enabled):

    library(parallelDist)
    library(amap)
    library(wordspace)
    library(microbenchmark)
    
    set.seed(123)
    x <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)
    
    microbenchmark(parDist(x, method = "manhattan"),
                   Dist(x, method = "manhattan", nbproc = 8),
                   dist.matrix(x, method = "manhattan"),
                   times = 10)
    
    Unit: milliseconds
                                          expr      min       lq     mean   median       uq      max neval
              parDist(x, method = "manhattan") 210.9478 214.3557 225.5894 221.3705 237.9829 247.0844    10
     Dist(x, method = "manhattan", nbproc = 8) 749.9397 755.7351 797.6349 812.6109 824.4075 844.1090    10
          dist.matrix(x, method = "manhattan") 256.0831 263.3273 279.0864 275.1882 296.3256 311.3821    10
    

    With a larger matrix:

    x <- matrix(rnorm(10000 * 100), nrow = 10000, ncol = 100)
    microbenchmark(parDist(x, method = "manhattan"),
    +                Dist(x, method = "manhattan", nbproc = 8),
    +                dist.matrix(x, method = "manhattan"),
    +                times = 10)
    Unit: seconds
                                          expr       min        lq      mean    median        uq       max neval
              parDist(x, method = "manhattan")  6.298234  6.388501  6.737168  6.894203  6.947981  7.221661    10
     Dist(x, method = "manhattan", nbproc = 8) 22.722947 24.113681 24.326157 24.477034 24.658145 25.301353    10
          dist.matrix(x, method = "manhattan")  7.156861  7.505229  7.544352  7.567980  7.655624  7.800530    10
    

    Further performance comparisons can be found in parallelDist's vignette.

    0 讨论(0)
  • 2020-12-08 17:36

    I am also working with somewhat large distance matrices and trying to speed-up the computation. Will Benson above is likely to be correct when he says that "the time to start up the function and export the variables to the cluster would probably be longer than just using".

    However, I think this applies to distance matrices with small to moderate size. See the example bellow using the functions Dist from the package amap with 10 processors, dist from the package stats, and rdist from package fields, which calls a Fortran function. The first example creates a 400 x 400 distance matrix. The second creates a 3103 x 3103 distance matrix.

    require(sp)
    require(fields)
    require(amap)
    data(meuse.grid)
    meuse.gridA <- meuse.grid[1:400, 1:2]
    meuse.gridB <- meuse.grid[, 1:2]
    
    # small distance matrix
    a <- Sys.time()
    invisible(dist(meuse.gridA, diag = TRUE, upper = TRUE))
    Sys.time() - a
    Time difference of 0.002138376 secs
    a <- Sys.time()
    invisible(Dist(meuse.gridA, nbproc = 10, diag = TRUE, upper = TRUE))
    Sys.time() - a
    Time difference of 0.005409241 secs
    a <- Sys.time()
    invisible(rdist(meuse.gridA))
    Sys.time() - a
    Time difference of 0.02312016 secs
    
    # large distance matrix
    a <- Sys.time()
    invisible(dist(meuse.gridB, diag = TRUE, upper = TRUE))
    Sys.time() - a
    Time difference of 0.09845328 secs
    a <- Sys.time()
    invisible(Dist(meuse.gridB, nbproc = 10, diag = TRUE, upper = TRUE))
    Sys.time() - a
    Time difference of 0.05900002 secs
    a <- Sys.time()
    invisible(rdist(meuse.gridB))
    Sys.time() - a
    Time difference of 0.8928168 secs
    

    Note how the computation time reduced from 0.09845328 secs to 0.05900002 secs using Dist compared to dist when the distance matrix was large (3103 x 3103). As such, I would suggest that you use function Dist from the amap package provided you have several processors available.

    0 讨论(0)
  • 2020-12-08 17:42

    I am a windows user looking for an efficient way to compute the distance matrix to use it in a hierarchical clustering (using the function hclust from the "stats" package for example). The function Dist doesn't work in parallel in Windows so I had to look for something different, and I found the "wordspace" package of Stefan Evert which contains the dist.matrix function. You can try this code:

    X <- data.frame(replicate(1000,sample(0:1,5000,rep=TRUE)))
    system.time(d <- dist(X, method = "manhattan"))
    system.time(d2 <- as.dist( dist.matrix(as.matrix(X), method="manhattan") ))
    

    As you can see computing the distance matrix for a dataframe with 1000 binary features and 5000 instances is much faster with dist.matrix

    These are the results in my laptop (i7-6500U):

    > system.time(d <- dist(X, method = "manhattan"))
       user  system elapsed 
     151.79    0.04  152.59 
    > system.time(d2 <- as.dist( dist.matrix(as.matrix(X), method="manhattan") ))
       user  system elapsed 
      19.19    0.22   19.56 
    

    This solved my problem. Here you can check the original thread where I found it: http://r.789695.n4.nabble.com/Efficient-distance-calculation-on-big-matrix-td4633598.html

    It doesn´t solve it in parallel but is enough in many occasions.

    0 讨论(0)
提交回复
热议问题