问题
I am working with the New York City taxi data set. The data set has columns including datetime, pickup lat/lon, dropoff lat/lon etc. Now I want to reverse geocode the lat/lon to find the borough/neighborhood.
I have two data frames. 1) The first data frame contains all the points I want to classify with the name of the nearest newyork neighborhood. 2) The second data frame contains neighborhood names and their centroids.
I show you a small example.
df_points_to_classify: Click here to Download original csv
longitude latitude
<dbl> <dbl>
1 -73.99037 40.73470
2 -73.98078 40.72991
3 -73.98455 40.67957
4 -73.99347 40.71899
df_neighborhood_names_and_their_centroids: Click here to Download original csv
longitude latitude neighborhood
<dbl> <dbl> <chr>
1 -73.8472005205491 40.89470517661004 Wakefield
2 -73.82993910812405 40.87429419303015 Co-op City
3 -73.82780644716419 40.88755567735082 Eastchester
4 -73.90564259591689 40.895437426903875 Fieldston
To assign the single point to a neighborhood I have to calculate the distance from the point to the centroid of each neighborhood. Obviously the point will belong to the neighborhood with the shortest distance.
The expected output consists of adding a column to the dataframe of the points to be classified containing the neighborhood to which each point belongs.
expected output:
longitude latitude neighborhood
<dbl> <dbl> <chr>
1 -73.99037 40.73470 Fieldston
2 -73.98078 40.72991 Co-op City
3 -73.98455 40.67957 etc...
4 -73.99347 40.71899 etc...
I would like to use a computationally efficient method because the database of my points to classify is very big (more than one gigabyte). For this reason I'm using spark on R. The file has been loaded this way.
library(sparklyr)
sc <- spark_connect(master = "local")
df_points_to_classify <- spark_read_csv(sc, "D:\df_points_to_classify.csv")
Is it possible to use dplr to solve this problem?
EDIT:
this solution isn't applicable when using spark because the result of df_points_to_classify$any_variable
is NULL
library(spatialrisk)
ans <- purrr::map2_dfr(df_points_to_classify$longitude,
df_points_to_classify$latitude,
~spatialrisk::points_in_circle(df_neighborhood_names_and_their_centroids, .x, .y,
lon = longitude,
lat = latitude,
radius = 2000000)[1,])
回答1:
I add below a solution using the spatialrisk package. The key functions in this package are written in C++ (Rcpp), and are therefore very fast.
First, load the data:
df1 <- data.frame(longitude = c(-73.99037, -73.98078, -73.98455, -73.99347),
latitude = c(40.73470, 40.72991, 40.67957, 40.71899))
df2 <- data.frame(longitude = c(-73.8472005205491, -73.82993910812405, -73.82780644716419, -73.90564259591689),
latitude = c(40.89470517661004, 40.87429419303015, 40.88755567735082, 40.895437426903875),
neighborhood = c("Wakefield", "Co-op City", "Eastchester", "Fieldston"))
The function spatialrisk::points_in_circle() calculates the observations within radius from a center point. Note that distances are calculated using the Haversine formula. Since each element of the output is a data frame, purrr::map_dfr is used to row-bind them together:
ans <- purrr::map2_dfr(df1$longitude,
df1$latitude,
~spatialrisk::points_in_circle(df2, .x, .y,
lon = longitude,
lat = latitude,
radius = 2000000)[1,])
cbind(df1, ans)
longitude latitude longitude latitude neighborhood distance_m
1 -73.99037 40.73470 -73.90564 40.89544 Fieldston 19264.50
2 -73.98078 40.72991 -73.90564 40.89544 Fieldston 19483.54
3 -73.98455 40.67957 -73.90564 40.89544 Fieldston 24933.59
4 -73.99347 40.71899 -73.90564 40.89544 Fieldston 20989.84
回答2:
Here is a complete solution, not necessary the most efficient but based on my machine estimated to. table about 90 minutes for 12 million starting locations.
Yes this could be made more efficient, but if this is a one time run; set it, forget and come back later for the results. One possible option to make this more efficient is to round the locations down to 3 or 4 decimal places and only find the location for the unique locations, then join the results back to the original dataframe.
library(readr)
library(dplyr)
library(stringr)
#read tax data in
taxi<-read_csv("yellow.csv")
#Removed unneeded columns (reduces memory requirements and improves speed)
taxi <- taxi %>% select( c(2:7, 10, 11, 13, 16 ))
#filter out rows that have bad data (far outside expected area)
taxi <- taxi %>% filter(pickup_longitude > -75 & pickup_longitude < -70)
taxi <- taxi %>% filter(dropoff_longitude > -75 & dropoff_longitude < -70)
taxi <- taxi %>% filter(pickup_latitude > 35 & pickup_latitude < 45)
taxi <- taxi %>% filter(dropoff_latitude > 35 & dropoff_latitude < 45)
point_class<-taxi[1:200000,] #reduce the sized of the starting vector for testing
#read neighborhood data and clean up data
df_neighborhood<-read.csv("NHoodNameCentroids.csv", stringsAsFactors = FALSE)
location<-str_extract(df_neighborhood$the_geom, "[-0-9.]+ [-0-9.]+")
location<-matrix(as.numeric(unlist(strsplit(location, " "))), ncol=2, byrow=TRUE)
df_neighborhood$longitude<- location[,1]
df_neighborhood$latitude <- location[,2]
df_neighborhood<-df_neighborhood[, c("OBJECTID", "Name", "Borough", "longitude", "latitude")]
#find closest neighbor to starting location
library(geosphere)
start<-Sys.time()
#preallocate the memory to store the result
neighborhood<-vector(length=nrow(point_class))
for (i in 1:nrow(point_class)) {
distance=distGeo(point_class[i,5:6], df_neighborhood[,4:5])
neighborhood[i]<-which.min(distance)
}
point_class$neighorhood<-df_neighborhood$Name[neighborhood]
point_class
print(Sys.time()-start)
来源:https://stackoverflow.com/questions/58540031/r-and-spark-compare-distance-between-different-geographical-points