R Improving efficiency of nested for() loops for simple distance calculations within large datasets

问题

I have two sets of points (with x,y,z coordinates) data dtmT (113k observations) and ptmT (200k observations). For every point in dtmT I'm looking to calculate the shortest distance to a point in ptmT. I'm very new to R and have no other programming background, so I've got nested for loops so that for each point of dtmT it calculates the distance between that point and every point in ptmT and stores that within a matrix (ptmTDistM). Post loop I use apply to get the minimum of each column in the matrix as a vector then use cbind to attach it back to dtmT so that the final product is dtmT with the x,y,z, Dist which represents the shortest distance possible from the dtmT point to any point within ptmT. This works great for the 5 observations and up to 500, but it hangs when I try it with 5,000 and the full dataset is 113K observations in dtmT and 200k in ptmT. I originally had this programmed using dataframes but have read some questions and answers that have led me to try with matrices. I also have understood that using vectors and the lapply group would be best, I'm not sure how to transition the nested for loops into the lapply group, especially since the indexing is so important for how I've got it now. I also have seen the Dist() but wasn't sure how to apply it to get what I need here.

The first 5 observations from each dataset are provided as well as what I've done so far.

Thanks very much for any help!

#first 5 observations of ptmT dataset
ptmT <- c(621019.2, 621024.2, 621023.7, 621018.3, 621019.2, 2701229.1, 
2701231.2, 2701231.9, 2701230.2, 2701229.1, 2071.5, 2080.0, 2080.0, 2071.5, 
2071.5)
dim(ptmT) <- c(5,3)
colnames(ptmT) <- c("XP", "YP", "ZP")

#first 5 observations of dtmT dataset
dtmT <- c( 621757.360, 621757.360, 621757.419, 621757.536, 
621757.540,2701071.810, 2701071.810, 2701071.814, 2701071.843, 2701071.844, 
2089.210, 2088.110, 2070.435, 2053.536, 2052.951)
dim(dtmT) <- c(5,3)
colnames(dtmT) <- c("X", "Y", "Z")

dtmTDist <- 0
ptmTDist <- 0
ptmTDistM <- matrix(data = NA, nrow = length(ptmT[,1]), ncol = 
length(dtmT[,1]))

require (svMisc)
for (row in 1:nrow(dtmT))   {
    progress(row)               
    X <- dtmT[row, "X"]
    Y <- dtmT[row, "Y"]
    Z <- dtmT[row, "Z"]

        for (i in 1:nrow(ptmT)) {
            X2 <- ptmT[i, "XP"]
            Y2 <- ptmT[i, "YP"]
            Z2 <- ptmT[i, "ZP"]

            D <- sqrt((X - X2)^2 + (Y - Y2)^2 + (Z - Z2)^2)
            ptmTDistM[i,row] <- D
            }
    }
Dist <- apply(ptmTDistM, 2, min)
dtmT2 <- cbind(dtmT,Dist)

回答1:

You can use nearest neighbours seaching packages such as https://github.com/jefferis/RANN that will return for each query point the nearest point and it's distance from reference points (using efficient spatial indexing)

P <- 200000
ptmT <- data.frame(x=runif(P),y=runif(P),z=runif(P))
D <- 113000
dtmT <- data.frame(x=runif(D),y=runif(D),z=runif(D))
library(RANN)
res <- nn2(ptmT,dtmT,1)

回答2:

You may be able to get some performance improvements by taking advantage of features like R's vector arithmetic. But any method that requires checking every point from set A against every point from set B will become very demanding as both sets become large, since the number of comparisons to be made scales with O(m*n) where m and n are the size of the two sets.

One trick that sometimes helps with this problem is to chunk your sets by geography, and use that chunking to determine which pairs you actually test.

For example, in 2D:

Randomly pick 100 points from A. For each of them, find the distance to their nearest neighbour in B, by comparing to every point in B. (Total: 100*n comparisons.)
Let h = maximum value from above.
Divide your space into blocks, size 2h x 2h. For any point in A, you can then be almost certain that its nearest neighbour in B will lie in its own block, or within one of the 8 neighbouring blocks.
For each point in B, determine which block it lies in, and set up an index or a list of vectors so you can easily reference "all points in B that lie in block [x,y]".
For each point P in A, find which block it lies in, and note how far it is from the nearest boundary of that block (call this d), then test it against all points in B that lie in the same block. (These is where you can take advantage of vector arithmetic.)
If you find a point in B that is closer than or equal to d, then this is definitely the nearest neighbour, and you can stop.
Otherwise, if the closest point you found was further than d, or there were no points at all from B in your search zone, expand the search to neighbouring blocks, and set d<-d+2h.
Repeat until you find a closest point, then go to next P until done.

This means that for each point in A, you're only testing against a small number of nearby points in B, rather than testing everything on the map. Even though the search method is more complex, for large m & n you should see much better search time.

If your data points have a very uneven distribution, you may need to play with the grid shape; ideally, the "blocks" are designed so that each contains only a few members of B.

Also, a minor economy: note that minimising distance-squared also minimises distance. So rather than finding min(dist) you can do sqrt(min(dist^2)) which will save you a lot of square-root operations, for what that's worth.

回答3:

Since we cannot avoid calculating the distance between two points, (unless that exact same point-pair has been calculated before) you definitely have to do 113,000*200,000 calculations for sure.

The only way to speed this up is by trying to make the computations as parallel as possible.

You should definitely try out the parallel packages suggested in the comments.

Here is my solution using the apply function in R, which tries to vectorize and compute as much as possible.

#Function to calculate Euclidean distance. We can simply use matrix algebra here.
computeDistance <- function(P,Q){
  D <- sqrt(sum((P-Q)^2))
  return(D)
}

#We use one apply row-wise on dtmT and for compute distance with each row in ptmT.
#Since this is a perfectlly parallel process, apply will be substantially faster than a for loop
distMat <- apply(dtmT, MARGIN = 1, function(p){apply(ptmT,MARGIN = 1,FUN = function(q){computeDistance(p,q)})})

#Calculate minimum of each column to get the minimum distance
minDist <- apply(distMat,2,min)

#Attach to dtmT
dtmTFinal <- cbind(dtmT,"Minimum_Distance" = minDist)

I tried this on a 5000*5000 situation and took around a minute on an average laptop.

Hope this helps.

回答4:

One major problem here is memory, since you 113k x 200k matrix will take about 170 GB memory. However, you never need the full matrix. Instead you only need the minimum value from each row. In addition, you can calculate this minimum in a vectorized fashion, leaving you with only one loop:

Dist <- vector(length = nrow(dtmT), mode = "numeric")
for (row in 1:nrow(dtmT))   {
  X <- dtmT[row, "X"]
  Y <- dtmT[row, "Y"]
  Z <- dtmT[row, "Z"]

  Dist[row] <- sqrt(min((X - ptmT[ ,"XP"])^2 + (Y - ptmT[ ,"YP"])^2 + (Z - ptmT[ , "ZP"])^2))
}
cbind(dtmT,Dist)

Now this loop is "embarrassingly parallel" which you can parallize for example using foreach:

library(foreach)
library(doParallel)
registerDoParallel(cores = 4)
Dist <- foreach (row = 1:nrow(dtmT), .combine = c) %dopar% {
  X <- dtmT[row, "X"]
  Y <- dtmT[row, "Y"]
  Z <- dtmT[row, "Z"]

  sqrt(min((X - ptmT[ ,"XP"])^2 + (Y - ptmT[ ,"YP"])^2 + (Z - ptmT[ , "ZP"])^2))  
}
cbind(dtmT,Dist)

An alternative to using a for loop would be apply. Combining this with a more compact notation we get:

apply(dtmT, 1, function(x) sqrt(min(colSums((x-t(ptmT))^2))))

Again, apply can be easily parallelized. Applying this to a problem of 10 times smaller size gives on a dual core machine:

library(parallel)
cl <- makeForkCluster(2)
dtmT <- matrix(runif(3 * 11300), ncol = 3)
ptmT <- matrix(runif(3 * 200000), ncol = 3)
system.time(Dist <- parApply(cl, dtmT, 1, function(x) sqrt(min(colSums((x-t(ptmT))^2)))))
#>        User      System verstrichen 
#>       0.021       0.004      34.474
head(cbind(dtmT, Dist))
#>                                            res
#> [1,] 0.9111543 0.5971182 0.8725145 0.010714792
#> [2,] 0.4893960 0.3321890 0.7440035 0.008545801
#> [3,] 0.3637524 0.6051168 0.7955850 0.003792442
#> [4,] 0.6684364 0.1819622 0.2487011 0.017937629
#> [5,] 0.6761877 0.1731773 0.3214378 0.011912805
#> [6,] 0.8060648 0.7789117 0.1673685 0.012680877

来源：https://stackoverflow.com/questions/49849645/r-improving-efficiency-of-nested-for-loops-for-simple-distance-calculations-wi

标签

for-loop

lapply