R and spark: Compare distance between different geographical points

问题

I am working with the New York City taxi data set. The data set has columns including datetime, pickup lat/lon, dropoff lat/lon etc. Now I want to reverse geocode the lat/lon to find the borough/neighborhood.

I have two data frames. 1) The first data frame contains all the points I want to classify with the name of the nearest newyork neighborhood. 2) The second data frame contains neighborhood names and their centroids.

I show you a small example.

df_points_to_classify: Click here to Download original csv

     longitude   latitude     
         <dbl>      <dbl>
1    -73.99037   40.73470
2    -73.98078   40.72991
3    -73.98455   40.67957 
4    -73.99347   40.71899

df_neighborhood_names_and_their_centroids: Click here to Download original csv

            longitude           latitude  neighborhood
                <dbl>              <dbl>         <chr>
1   -73.8472005205491  40.89470517661004     Wakefield 
2  -73.82993910812405  40.87429419303015    Co-op City
3  -73.82780644716419  40.88755567735082   Eastchester 
4  -73.90564259591689 40.895437426903875     Fieldston

To assign the single point to a neighborhood I have to calculate the distance from the point to the centroid of each neighborhood. Obviously the point will belong to the neighborhood with the shortest distance.

The expected output consists of adding a column to the dataframe of the points to be classified containing the neighborhood to which each point belongs.

expected output:

     longitude   latitude  neighborhood
         <dbl>      <dbl>         <chr>
1    -73.99037   40.73470     Fieldston
2    -73.98078   40.72991    Co-op City
3    -73.98455   40.67957        etc...
4    -73.99347   40.71899        etc...

I would like to use a computationally efficient method because the database of my points to classify is very big (more than one gigabyte). For this reason I'm using spark on R. The file has been loaded this way.

library(sparklyr)
sc <- spark_connect(master = "local")
df_points_to_classify <- spark_read_csv(sc, "D:\df_points_to_classify.csv")

Is it possible to use dplr to solve this problem?

EDIT: this solution isn't applicable when using spark because the result of df_points_to_classify$any_variable is NULL

library(spatialrisk)
ans <- purrr::map2_dfr(df_points_to_classify$longitude, 
                       df_points_to_classify$latitude, 
                       ~spatialrisk::points_in_circle(df_neighborhood_names_and_their_centroids, .x, .y, 
                                                      lon = longitude, 
                                                      lat = latitude, 
                                                      radius = 2000000)[1,])

回答1:

I add below a solution using the spatialrisk package. The key functions in this package are written in C++ (Rcpp), and are therefore very fast.

First, load the data:

df1 <- data.frame(longitude = c(-73.99037, -73.98078, -73.98455, -73.99347), 
                  latitude = c(40.73470, 40.72991, 40.67957, 40.71899))

df2 <- data.frame(longitude = c(-73.8472005205491, -73.82993910812405, -73.82780644716419, -73.90564259591689), 
                  latitude = c(40.89470517661004, 40.87429419303015, 40.88755567735082, 40.895437426903875), 
                  neighborhood = c("Wakefield", "Co-op City", "Eastchester", "Fieldston"))

The function spatialrisk::points_in_circle() calculates the observations within radius from a center point. Note that distances are calculated using the Haversine formula. Since each element of the output is a data frame, purrr::map_dfr is used to row-bind them together:

ans <- purrr::map2_dfr(df1$longitude, 
                       df1$latitude, 
                       ~spatialrisk::points_in_circle(df2, .x, .y, 
                                                      lon = longitude, 
                                                      lat = latitude, 
                                                      radius = 2000000)[1,])


cbind(df1, ans)

 longitude latitude longitude latitude neighborhood distance_m
1 -73.99037 40.73470 -73.90564 40.89544    Fieldston   19264.50
2 -73.98078 40.72991 -73.90564 40.89544    Fieldston   19483.54
3 -73.98455 40.67957 -73.90564 40.89544    Fieldston   24933.59
4 -73.99347 40.71899 -73.90564 40.89544    Fieldston   20989.84

回答2:

Here is a complete solution, not necessary the most efficient but based on my machine estimated to. table about 90 minutes for 12 million starting locations.
Yes this could be made more efficient, but if this is a one time run; set it, forget and come back later for the results. One possible option to make this more efficient is to round the locations down to 3 or 4 decimal places and only find the location for the unique locations, then join the results back to the original dataframe.

library(readr)
library(dplyr)
library(stringr)

#read tax data in
taxi<-read_csv("yellow.csv")
#Removed unneeded columns (reduces memory requirements and improves speed)
taxi <- taxi %>% select( c(2:7, 10, 11, 13, 16 ))
#filter out rows that have bad data (far outside expected area)
taxi <- taxi %>% filter(pickup_longitude  > -75 & pickup_longitude  < -70)
taxi <- taxi %>% filter(dropoff_longitude  > -75 & dropoff_longitude  < -70)
taxi <- taxi %>% filter(pickup_latitude  > 35 & pickup_latitude  < 45)
taxi <- taxi %>% filter(dropoff_latitude  > 35 & dropoff_latitude  < 45)

point_class<-taxi[1:200000,]  #reduce the sized of the starting vector for testing

#read neighborhood data and clean up data
df_neighborhood<-read.csv("NHoodNameCentroids.csv", stringsAsFactors = FALSE)
location<-str_extract(df_neighborhood$the_geom, "[-0-9.]+ [-0-9.]+")
location<-matrix(as.numeric(unlist(strsplit(location, " "))), ncol=2, byrow=TRUE)
df_neighborhood$longitude<- location[,1]
df_neighborhood$latitude <- location[,2]
df_neighborhood<-df_neighborhood[, c("OBJECTID", "Name", "Borough", "longitude", "latitude")]

#find closest neighbor to starting location
library(geosphere)
start<-Sys.time()
#preallocate the memory to store the result
neighborhood<-vector(length=nrow(point_class)) 
for (i in 1:nrow(point_class)) {
  distance=distGeo(point_class[i,5:6], df_neighborhood[,4:5])
  neighborhood[i]<-which.min(distance)
}

point_class$neighorhood<-df_neighborhood$Name[neighborhood]
point_class
print(Sys.time()-start)

来源：https://stackoverflow.com/questions/58540031/r-and-spark-compare-distance-between-different-geographical-points

标签

apache-spark

dplyr

geolocation

mapping