问题
I want to compute the distance among all points in a very large matrix using distm from geosphere.
See a minimal example:
library(geosphere)
library(data.table)
coords <- data.table(coordX=c(1,2,5,9), coordY=c(2,2,0,1))
distances <- distm(coords, coords, fun = distGeo)
The issue is that due to the nature of the distances I am computing, distm gives me back a symmetric matrix, therefore, I could avoid to calculate more than half of the distances:
structure(c(0, 111252.129800202, 497091.059564718, 897081.91986428,
111252.129800202, 0, 400487.621661164, 786770.053508848, 497091.059564718,
400487.621661164, 0, 458780.072878927, 897081.91986428, 786770.053508848,
458780.072878927, 0), .Dim = c(4L, 4L))
May you help me to find a more efficient way to compute all those distances avoiding doing twice each one?
回答1:
If you want to compute all pairwise distances for points x, it is better to use distm(x) rather than distm(x,x). The distm function returns the same symmetric matrix in both cases but when you pass it a single argument it knows that the matrix is symmetric, so it won't do unnecessary computations.
You can time it.
library("geosphere")
n <- 500
xy <- matrix(runif(n*2, -90, 90), n, 2)
system.time( replicate(100, distm(xy, xy) ) )
# user system elapsed
# 61.44 0.23 62.79
system.time( replicate(100, distm(xy) ) )
# user system elapsed
# 36.27 0.39 38.05
You can also look at the R code for geosphere::distm to check that it treats the two cases differently.
Aside: Quick google search finds parallelDist: Parallel Distance Matrix Computation on CRAN. The geodesic distance is an option.
回答2:
You can prepare a data frame of possible combinations without repetitions (with gtools packages). Then to compute distances for those pairs. Here is the code:
library(gtools)
library(geosphere)
library(data.table)
coords <- data.table(coordX = c(1, 2, 5, 9), coordY = c(2, 2, 0, 1))
pairs <- combinations(n = nrow(coords), r = 2, repeats.allowed = F, v = c(1:nrow(coords)))
distances <- apply(pairs, 1, function(x) {
distm(coords[x[1], ], coords[x[2], ], fun = distGeo)
})
# Construct distances matrix
dist_mat <- matrix(NA, nrow = nrow(coords), ncol = nrow(coords))
dist_mat[upper.tri(dist_mat)] <- distances
dist_mat[lower.tri(dist_mat)] <- distances
dist_mat[is.na(dist_mat)] <- 0
print(dist_mat)
The results:
[,1] [,2] [,3] [,4]
[1,] 0.0 111252.1 497091.1 400487.6
[2,] 111252.1 0.0 897081.9 786770.1
[3,] 497091.1 400487.6 0.0 458780.1
[4,] 897081.9 786770.1 458780.1 0.0
回答3:
Using combn() from base R might be slightly simpler and probably faster than loading additional packages. Then, distm() uses distGeo() as a source, so using the latter should be even faster.
coords <- as.data.frame(coords) # this won't work with data.tables though
cbind(t(combn(1:4, 2)), unique(geosphere::distGeo(coords[combn(1:4, 2), ])))
# [,1] [,2] [,3]
# [1,] 1 2 111252.1
# [2,] 1 3 497091.1
# [3,] 1 4 897081.9
# [4,] 2 3 786770.1
# [5,] 2 4 400487.6
# [6,] 3 4 458780.1
We could check it out with a benchmark.
Unit: microseconds
expr min lq mean median uq max neval cld
distm 555.690 575.846 597.7672 582.352 596.1295 904.718 100 b
distGeo 426.335 434.372 450.0196 441.516 451.8490 609.524 100 a
Looks good.
来源:https://stackoverflow.com/questions/54971461/matrix-of-distances-with-geosphere-avoid-repeat-calculus