Fuzzy matching of coordinates

本小妞迷上赌 提交于 2019-12-05 07:18:42

问题


I have two datasets, one of them containing the coordinates of people's addresses (addresses), and the other one containing the coordinates of rainfall in certain locations (rain). The coordinates are standard lat and lon. I would like to merge these two sets together, by matching each address to the nearest rainfall location, using the spherical distance between two coordinates to determine the "nearest". The naive way is to compute all pairwise distances between each address and each rainfall location and keep the minimum, but since my datasets are quite big, I was wondering if there was another computationally efficient way to do this.

I'm using the geosphere package to calculate distance.

Here's a subset of the data.

rain <- structure(list(lat = c(-179.75, -179.75, -179.75, -179.75, -179.75, 
-179.75, -179.75, -179.75, -179.75, -179.75), lon = c(71.25, 
68.75, 68.25, 67.75, 67.25, 66.75, 66.25, 65.75, 65.25, -16.75
), rainfall = c(0, 4.9, 4.6, 4.9, 8.9, 15.2, 24.2, 16.3, 12.2, 
365.4)), .Names = c("lat", "lon", "rainfall"), class = "data.frame", row.names = c(NA, 
-10L))


addresses <- structure(list(address_lat = c(-175.33, -175.20, -177.65, -174.10, -175.80, 
-179.50, -179.23, -179.12, -178.75, -174.77), address_lon = c(70.25, 
69.75, 62.23, 60.50, 66.25, 61.75, 62.54, 63.70, 61.45, -15.80),
person_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), .Names = c("address_lat", "address_lon",     
"person_id"), class = "data.frame", row.names = c(NA, -10L))

I've got 300,000 unique coordinate pairs in one set, and over 80,000 in the other. The only idea I have is to use two for loops, one to run over a list of address coordinate pairs, then another nested one to calculate the distance from each address to all rainfall locations, then keeping the smallest.


回答1:


First I should mention that I think that the column labels for latitude and longitude should be reversed... otherwise you end up with latitudes that are less than -90. :-) I have done this for my solution below.

library(geosphere)

D = distm(addresses[, 1:2], rain[, 1:2])
#
cbind(addresses, rain[apply(D, 1, which.min),])

First you form the distance matrix. Each row in this matrix gives the distances from one of the addresses to each of the rainfall observations. We use which.min to pick out the smallest entry in each row and then use this to index into the rainfall data.



来源:https://stackoverflow.com/questions/20590119/fuzzy-matching-of-coordinates

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!