I want to calculate the average geographical distance between a number of houses per province.
Suppose I have the following data.
df1 <- data.fram
Solution:
lapply(split(df1, df1$province), function(df){
df <- Expand.Grid(df[, c("lat", "lon")], df[, c("lat", "lon")])
mean(distHaversine(df[, 1:2], df[, 3:4]))
})
where Expand.Grid()
is taken from https://stackoverflow.com/a/30085602/3502164.
Explanation:
1. Performance
I would avoid using distm()
as it transforms a vectorised function distHaversine()
into an unvectorised distm()
.
If you look at the source code you see:
function (x, y, fun = distHaversine)
{
[...]
for (i in 1:n) {
dm[i, ] = fun(x[i, ], y)
}
return(dm)
}
While distHaversine()
sends the "whole object" to C, distm()
sends the data "row-wise" to distHaversine()
and therefore forces distHaversine()
to do the same when executing the code in C. Therefore, distm()
should not be used. In terms of performance i see more harm using the wrapper function distm()
as i see benefits.
2. Explaining the code in "solution":
a) Splitting in groups:
You want to analyse the data per group: province.
Splitting into groups can be done by: split(df1, df1$province)
.
b) Grouping "clumps of columns"
You want to find all unique combinations of lat/lon. First guess might be expand.grid()
, but that does not work for mulitple columns. Luckily Mr. Flick took care of this expand.grid function for data.frames in R.
Then you have a data.frame()
of all possible combinations and just have to use
mean(distHaversine(...))
.