问题
I have a large list of longitude and latidue data corresponding to fast food places in the U.S. For each fast food place, I want to know how many other fast food places are within 5 miles. I could calculate this in Pandas using Geopy like so (each row in the DataFrame is a different fast food place):
import pandas as pd
import geopy.distance
df = pd.DataFrame({'Fast Food Place':[1,2,3], 'Lat':[33,34,35], 'Lon':[42,43,44]})
for index1, row1 in df.iterrows():
num_fastfood = 0
for index2, row2 in df.iterrows():
# calculate distance in miles between longitude and latitude
dist = geopy.distance.VincentyDistance(row1[['Lat','Lon']],
row2[['Lat','Lon']]).miles
# if fast food is within 5 miles, increment num_fastfood
if dist < 5: # if less than five miles
num_fastfood = num_fastfood + 1
df.loc[index1, 'num_fastfood_5miles'] = num_fastfood - 1 # (subtract 1 to exclude self)
But this is extremely slow on very large data sets (i.e. 50,000 rows). I considered using a KDTree for the search, but curious if other people have a much quicker method?
回答1:
Implementation with scipy.spatial.cKDTree
:
from scipy.spatial import cKDTree
def find_neighbours_within_radius(xy, radius):
tree = cKDTree(xy)
within_radius = tree.query_ball_tree(tree, r=radius)
return within_radius
def flatten_nested_list(nested_list):
return [item for sublist in nested_list for item in sublist]
def total_neighbours_within_radius(xy, radius):
neighbours = find_neighbours_within_radius(xy, radius)
return len(flatten_nested_list(neighbours))
来源:https://stackoverflow.com/questions/43592094/efficient-way-to-calculate-geographic-density-in-pandas