Geographical distance by group - Applying a function on each pair of rows

前端 未结 7 888
清歌不尽
清歌不尽 2020-12-21 05:04

I want to calculate the average geographical distance between a number of houses per province.

Suppose I have the following data.

df1 <- data.fram         


        
7条回答
  •  攒了一身酷
    2020-12-21 05:39

    Solution:

    lapply(split(df1, df1$province), function(df){
      df <- Expand.Grid(df[, c("lat", "lon")], df[, c("lat", "lon")])
      mean(distHaversine(df[, 1:2], df[, 3:4]))
    })
    

    where Expand.Grid() is taken from https://stackoverflow.com/a/30085602/3502164.

    Explanation:

    1. Performance

    I would avoid using distm() as it transforms a vectorised function distHaversine() into an unvectorised distm(). If you look at the source code you see:

    function (x, y, fun = distHaversine) 
    {
       [...]
       for (i in 1:n) {
            dm[i, ] = fun(x[i, ], y)
        }
        return(dm)
    }
    

    While distHaversine() sends the "whole object" to C, distm() sends the data "row-wise" to distHaversine() and therefore forces distHaversine() to do the same when executing the code in C. Therefore, distm() should not be used. In terms of performance i see more harm using the wrapper function distm() as i see benefits.

    2. Explaining the code in "solution":

    a) Splitting in groups:

    You want to analyse the data per group: province. Splitting into groups can be done by: split(df1, df1$province).

    b) Grouping "clumps of columns"

    You want to find all unique combinations of lat/lon. First guess might be expand.grid(), but that does not work for mulitple columns. Luckily Mr. Flick took care of this expand.grid function for data.frames in R.

    Then you have a data.frame() of all possible combinations and just have to use mean(distHaversine(...)).

提交回复
热议问题