Geographical distance by group - Applying a function on each pair of rows

前端未结

关注

 7  888

清歌不尽 2020-12-21 05:04

I want to calculate the average geographical distance between a number of houses per province.

Suppose I have the following data.

df1 <- data.fram


      
      
        
          7条回答        

        
                    
            
            
                         
                
              
              
                
                   攒了一身酷
                                             
                
                
                (楼主)
            
              
              
                2020-12-21 05:39
              

            
            
                        
Solution:
lapply(split(df1, df1$province), function(df){
  df <- Expand.Grid(df[, c("lat", "lon")], df[, c("lat", "lon")])
  mean(distHaversine(df[, 1:2], df[, 3:4]))
})

where Expand.Grid() is taken from https://stackoverflow.com/a/30085602/3502164.
Explanation:
1. Performance
I would avoid using distm() as it transforms a vectorised function distHaversine() into an unvectorised distm().
If you look at the source code you see:
function (x, y, fun = distHaversine) 
{
   [...]
   for (i in 1:n) {
        dm[i, ] = fun(x[i, ], y)
    }
    return(dm)
}

While distHaversine() sends the "whole object" to C, distm() sends the data "row-wise" to distHaversine() and therefore forces distHaversine() to do the same when executing the code in C. Therefore, distm() should not be used. In terms of performance i see more harm using the wrapper function distm() as i see benefits.
2. Explaining the code in "solution":
a) Splitting in groups:
You want to analyse the data per group: province.
Splitting into groups can be done by: split(df1, df1$province).
b) Grouping "clumps of columns"
You want to find all unique combinations of lat/lon. First guess might be expand.grid(), but that does not work for mulitple columns. Luckily Mr. Flick took care of this expand.grid function for data.frames in R.
Then you have a data.frame() of all possible combinations and just have to use
mean(distHaversine(...)).
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它7个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复